Communicated by Wulfram Gerstner
ARTlCLE
Lower Bounds for the Computational Power of Networks of Spiking Neurons Wolfgang Maass lizstitiite for Theoretical Computer Science, Technische Uniuersitaet Gmz, Klosteriuiesgasse 3212, A-8010 Graz, Aus tvia We investigate the computational power of a formal model for networks of spiking neurons. It is shown that simple operations on phase differences between spike-trains provide a very powerful computational tool that can in principle be used to carry out highly complex computations on a small network of spiking neurons. We construct networks of spiking neurons that simulate arbitrary threshold circuits, Turing machines, and a certain type of random access machines with real valued inputs. We also show that relatively weak basic assumptions about the response and threshold functions of the spiking neurons are sufficient to employ them for such computations. 1 Introduction and Basic Definitions
There is substantial evidence that timing phenomena such as temporal differences between spikes and frequencies of oscillating subsystems are integral parts of various information processing mechanisms in biological neural systems (for a survey and references see, e.g., Kandel et al. 1991; Abeles 1991; Churchland and Sejnowski 1992; Aertsen 1993). Furthermore, simulations of a variety of specific mathematical models for networks of spiking neurons have shown that temporal coding offers interesting possibilities for solving classical benchmark problems such as associative memory, binding, and pattern segmentation (for an overview see Gerstner et al. 1993). Very recently one has also started to build artificial neural nets that model networks of spiking neurons (see, e.g., Murray and Tarassenko 1994; Watts 1994). Some aspects of these models have also been studied analytically (see, e.g., Gerstner and van Hemmen 1994; Gerstner 1995), but almost nothing is known about their computational complexity (see Judd and Aihara 1993, for some first results in this direction). In this article we investigate a simple formal model SNN for networks of spiking neurons that allows us to model the most important timing phenomena of neural nets, and we prove lower bounds for its computational power. Quite a number of different mathematical models for networks of spiking neurons have previously been introduced within the frameworks Neural Computation 8, 1-40 (1996)
@ 1995 Massachusetts Institute of Technology
2
Wolfgang Maass
of theoretical physics and theoretical neurobiology (see, e.g., Lapicque 1907; Buhmann and Schulten 1986; Crair and Bialek 1990; Gerstner 1991; Gerstner et al. 1993; for a survey and the relationship between these and related models see, e.g., Tuckwell 1988; and Gerstner 1995). The computational model SNN that we consider in this article is most closely related to the spike response inodel of Gerstner (1991) and Gerstner e f 01. (1993). Similarly as in Buhmann and Schulten (1986), we consider in this article only the deterministic case (which corresponds to the limit case J i ixj for the inverse temperature ) j in the spike response model, and respectively, the noise-free case). We refer to Maass (1995d) for results about the computational power of the noisy version of this model. However, in contrast to these preceding models we do not fix particular (necessarily somewhat arbitrarily chosen) response and threshold functions in our model SNN. Instead, we want to be able to use the SNN model as a framework for inzwtigating the computational power of various different response and threshold functions. In addition, we would like to make sure that various different response and threshold functions observed in specific biological neural systems are in fact special cases of the response and threshold functions in the formal model SNN. 1.1 Definition of a Spiking Neuron Network (SNN). An SNN consists of 0
0 0 0
0
N
a finite directed graph (V.E) (we refer to the elements of V as "neurons" and to the elements of E as "s~yitapses") a subset V,, C V of i n p u t neurons a subset Vout 2 V of output neiirons for each neuron z1 E V - V,, a thresholdfiiizction O,, : R+ R U {x} (where R+ := {x E R : x 2 O}) for each synapse (ti. ZI) E E a response fiiiiction zrr,, : R+ R and a -+
zoeight fiinction zuII : R+
-+
R.
-
We assume that the firing of the input neurons zi E V,, is determined from outside of N , i.e., the sets F,, C R+ of firing times ("spike trains") for the neurons z, E V,, are given as the input of N . Furthermore we assume that a set T C R+ of potential firing tiines has been fixed (we will consider only the cases T = R+ and T = { i . p : i E N} for some > 0). For a neuron zj E V- V,, one defines its set F,, of firing times recursively. The first element of F,, is inf{t E T : P , ( t ) 2 O,(O)}, and for any s E Fi, the next larger element of F,, is inf{t E T : t > s and P,,(t) 2 O,,(t - s)}. where the potential ftinction P,, : R+ + R is defined by
[the trivial summand 0 makes sure that P , ( t )is well-defined even if F,, = d, for all 11 with (z1.u) E El.
Coinputational Power of Networks of Spiking Neurons
3
The firing times (”spike trains”) F,, of the output neurons u E V,,, that result in this way are interpreted as the output of N . Regarding the set T of potential firing times we consider in this article primarily the case T = R+ (SNN with continuous time), and only in Corollary 2.5 the case T = {i . / I : i E N} for some p > 0 (SNN with discrete
time). Our subsequent assumptions about the threshold functions O,, will imply that for each SNN N there exists a bound TN E R with 7 , >~ 0 such that O,,(x) = 00 for all x E ( 0 . 7 ~ and ) all u E V - Vi, (TN may be interpreted as the minimum of all ”refractory periods” TWf of neurons in N ) . Furthermore we assume that all ”input spike trains” F , with u E Vi, satisfy IF, n [O. t]j < ocj for all t E R+.On the basis of these assumptions one can also in the continuous case easily show that the firing times are well-defined for all u E V - V,, (and occur in distances of at least 7A.l. In models for biological neural systems one assumes that if x time-units have passed since its last firing, the current threshold O , ( x ) of a neuron u is ”infinite” for x < Tref (where 7,f = refractory period of neuron zi), and then approaches quite rapidly from above some constant value. A neuron u ”fires” (i.e., it sends an ”action potential” or ”spike” along its axon) when its current membrane potential P U ( t )at the axon hillock exceeds its current threshold 8 , . P,(t) is the sum of various postsynaptic -s). Each of these terms describes an excitatory (EPSP) potentials wl,,~,~~ll,,(t or inhibitory (IPSP) postsynaptic potential at the axon hillock of neuron u at time t, as a result of a spike that had been generated by the ”presynaptic” neuron u at time s, and which has been transmitted through a synapse between both neurons. Recordings of an EPSP typically show a function that has a constant value c (c = resting membrane potential; e.g., c = -70 mV) for some initial time interval (reflecting the axonal and synaptic transmission time), then rises to a peak value, and finally drops back to the same constant value c. An IPSP tends to have the negative shape of an EPSP (see Fig. 3). For the sake of mathematical simplicity we assume in the SNN model that the constant initial and final value of all response functions E ~ ~ , :is , equal to 0 (in other words, E ~ ~ models . ~ , the difference between a postsynaptic potential and the resting membrane potential c). Different presynaptic neurons u generate postsynaptic potentials of different sizes at the axon hillock of a neuron ZI, depending on the size, location, and current state of the synapse (or synapses) between u and u. This effect is modeled by the weight factors W ~ , , ~ , ( S ) . The precise shapes of threshold, response, and weight functions may vary among different biological neural systems, and even within the same system. Fortunately one can prove significant upper bounds for the computational complexity of SNNs N without any assumptions about the specific shapes of these functions of N . Instead, for such upper bounds one only has to assume that they are of a reasonably simple mathematical structure (see Maass 399410, 1995~).
4
Wolfgang Maass
To prove lower boziizds for the computational complexity of an SNN
N one is forced to make more specific assumptions about these functions. However, we show in this article that significant (and in some cases optimal, see Section 3) lower bounds can be shown under some rather weak basic assurnptions about these functions, which will be further relaxed in Section 4. These basic assumptions (see Section 2) mainly require that EPSPs have an arbitrarily small time segment where they increase linearly, and some arbitrarily small time segment where they decrease linearly. Since the computational power of SNNs may potentially increase through the use of time-dependent weights, l o u w bounds for their computational power are more significant if they do not involve the use of time-dependent weights. Hence we will assume throughout this article that all 7iieight-fiiizctioizs I L ~ , , . , , ( S )have a constant value w,,,,,, 7iihich does not depend on the time s. Apart from the abovementioned condition on the existence of linear segments in EPSPs, the basic assumptions that underlie the lower bound results of this article involve no other significant conditions on the shape of response and threshold functions. Hence one may argue that these basic assumptions are biologically plausible. In addition, we will show in Section 4 that the same lower bounds can be shown if also phenomena such as "adaption" of neurons, or a "reset" of the potential after a firing are taken into account. Thus the more critical points with regard to the biological interpretation of these lower bound results appear to be the relatively simple firing mechanism of the SNN model, which, for example, ignores for the sake of simplicity nonlinear interactions among postsynaptic potentials such as integration of potentials within the dendritic tree of a neuron, and various possible sources of "imprecision" in the determination of the firing times. The latter issue can partially be taken into account by considering the variation of the SNN model with discrete firing times as in Corollary 2.5 (although the implicit global synchronization of this version is not completely satisfactory). In this variation of the SNN model with discrete firing times i . p for i E N one can view a firing of a neuron at time i . p as representing a somewhat imprecise firing time in a small interval around time i . p. The computational complexity of another neural network model where timing plays an important role has previously been investigated by Judd and Aihara (1993). Their model PI" is also motivated by biological spiking neurons, but it employs a quite different firing mechanism. There are no response functions in this model, and instead of integrating all incoming EPSPs and IPSPs in order to determine whether it should "fire," a neuron in a PPN randomly selects a single one of the incoming "stimulations'' of maximal size, and determines on the basis of that stimulation whether it should fire. Consequently, computations in this model PI" proceed quite differently from computations in models of spiking neurons such as the spike response model of Gerstner and van Hemmen (1994), or the model SNN considered here. Judd and Aihara (1993) con-
Computational Power of Networks of Spiking Neurons
5
struct PPNs that can simulate Turing machines that use at most a constant number s of cells on their tapes, where s is bounded by the number of neurons in the simulating PPN. However a Turing machine with a constant bound s on its number of tape cells is just a special case of a finite automaton, and hence this result does not show that a PPN of finite size can have the computational power of an arbitrary Turing machine. In contrast to the quoted result about PPNs, it is shown in Theorem 2.1 of this article that with arbitrary response and threshold functions that satisfy the basic assumptions of Section 2, one can construct for any given Turing machine M an SNN n / ~of finite size that can simulate any computation of M in real-time (even if the number of tape cells that M uses is much larger than the number of neurons in NM).In addition, at the end of Section 4 we will describe a way in which a simulation of arbitrary Turing machines can also be accomplished by finite SNNs whose response and threshold functions are piecewise constant. If we understand the model of Judd and Aihara (1993) correctly (their description is somewhat unclear), then our method for proving this (see also Maass and Ruf 1995) can also be used to show that with the help of a module that decides whether two neurons have fired simultaneously, one can simulate (although not in real-time) any Turing machine M (where M may use an unbounded number of tape cells) by some PPN PM of finite size, thereby improving the lower bound for the computational power of PPNs due to Judd and Aihara (19931, from finite automata to Turing machines. The focus in the investigation of computations in biological neural systems differs in two essential aspects from that of classical computational complexity theory. First, the timing constraints for computations in biological neural systems are often tighter than for computations in digital computers, and many complex computations have to be carried out in ”real-time” with relatively slow “switching elements.” Secondly, one is not only interested in separate computations on unrelated inputs, but also in the ability of the system to learn to react appropriately to a sequence of related tasks. Hence the custom to evaluate the computational power in terms of ”complexity classes” such as P or P/poly appears to be less suitable for the investigation of models for biological neural systems, and we therefore resort to an analysis in terms of refined concepts such as ”real-time computations” and “real-time simulations.” In this way we get not only information about the relationship between the “large scale” complexity classes (e.g., polynomial time) for these models for biological neural systems, but also about their behavior in terms of common notions of ”low-level complexity” such as sublinear or real-time. Furthermore, with the help of our refined analysis of real-time simulations one also gets information about the “adaptive” or “learning” abilities of the considered models. Assume for example that ((x(i).y(i))),tN is the protocol of some real-time ”learning process” of a system M, where the y(i) are the ”responses” of M to a sequence x ( i ) of ”stimuli.” If one
6
Wolfgang Maass
has shown that another model M' can simulate M in real-time, then this entails that the same "learning process" can also be carried out in realtime by M'. 1.2 Definition of Real-Time Computation and Real-Time Simulation. Fix some arbitrary (finite or infinite) input alphabet A,, and output alphabet Aout (for example they can be chosen to be (0, l}, (0. l}' or R). We say that a iizacliiize M processes a s e q w i c e ( ( x ( i ) > y ( i ) ) ) l of E Npairs (x(i).y(i)) E A,,x A,,, iiz real-time Y, if M outputs y(i) for every i E N within r computation steps after having received input x ( i ) [for i > 0 we assume that x(i) is presented at the next step after M has given output y(i - 1)l. We say that a macliiiie M' siiiiiilates a machine M iii real-tinze (wifli delay factor A) if for every I' E N and every sequence that is processed by M in real-time r, M' can process the same sequence in real time A . r. In the case of SNNs M we count each spike in M as a computation step.
We first would like to point out that these notions contain the usual notions of a computation and simulation as special cases. Let {0.1}* be the set of all binary sequences of finite length. If M computes a Boolean function F : (0. l}" 4 (0.1) in time t ( n ) (in the usual sense of computational complexity theory), then one can identify each input ( z , , . . . . z , ~ )E (0.1}* with an infinite sequence (x(i))lENwhere s(i) = 2 , for i 5 i i and x(i) = B for i > ii (assume that M gets one input bit per step, B := "blank). Furthermore one can set y(i) = B for those steps i where M s computation is not yet finished, and y(i) = F ( ( z 1 . . . . . z , , ) ) for all later i [in particular for all i 2 t(tz)l. Obviously M processes this sequence ( ( x ( i ) y. ( i ) ) ) l t N in real-time 1. Hence, if another machine M' can simulate M in real-time with delay factor A, then M' can compute the same function F : (0. l}' + (0. l} in time A.t(iz).This implies that a realtime simulation is a special case of a linear-time simulation. In particular, every computational problem that can be solved by M within a certain time complexity can be solved by M' within the same time complexity (up to a constant factor). In addition, the remarks before the definition imply that when we show that M' can simulate M in real-tiiiie, we may conclude that any adnptiue behazk~r(or leariiing algorithm) of M can also be implemented on M'. Finally we would like to point out that for the investigation of specific computational and learning problems on specific models for biological neural nets one would like to also eventually get estimates for the size of the constant r in real-time processing and the size of the delay factor A in a real-time simulation. Such refined analysis (which will not be carried out in this paper) appears to be also of interest, since it is likely to throw some light on the specific advantages and disadvantages of different models for biological neural systems (e.g., networks of spiking
Computational Power of Networks of Spiking Neurons
7
neurons versus analog neural nets), which are shown in Maass (199413, 1995c), to be equivalent with regard to the preceding notion of a real-time simulation. In contrast to the usual notion of a simulation, a real-time simulation of another computational model M by an SNN implies that the simulation of each computation step of M requires only a fixed number of spikes in the SNN. In particular, the required number of spikes does not become larger for the simulation of later computation steps of M. 1.3 Input and Output Conventions. For simulations between SNNs and Turing machines one may either assume that the SNN gets an input (or produces an output) from {O.l}* in the form of a spike train he., one bit per unit of time), or that the input (output) of the SNN is encoded into the phase difference of just two spikes. The former convention is suitable for comparisons with Turing machines that receive a single input bit and produce a single output bit at each computation step. For comparisons with Turing machines that start with the whole input written on a specified tape, and have their whole output written on another tape when the machine halts, it is more adequate to assume that the SNN receives at the beginning of a computation the whole tape content of the input tape encoded into the time difference cp between two spikes (using the same encoding as we will use in Section 2 to represent the content of a stack), and that the SNN also provides the final content of the output tape in the same form. Real-vnlued input or output for an SNN is always encoded into the phase difference of two spikes.
1.4 Notation. We employ in this article the following common notation: We write N for the set of natural numbers (including 0), Q for the set of rational numbers, and R for the set of real numbers. R+ is defined as {x E R : x 2 O}. For any x E R+ we write [xl for the least I I E N with II 2 x. {0.1}* denotes the set of all binary strings of finite length. For any set S we write 3x E S(. . .) instead of 3x(x E S and . . .), and Vx E S(. . .) instead of Vx(x E S + . . .). For two functions f . g : N i N we write f = O(g) if there is some constant c such that f(n)5 c . g ( i i ) for all except possibly finitely many IZ E N. 1.5 Structure of This Article. In Section 2 we specify our basic assumptions about the response and threshold functions of an SNN, and we construct SNNs that can simulate in real-time arbitrary threshold circuits and Turing machines. In Section 3 we relate the computational power of SNNs for real-valued inputs to a specific type of random access machine. In Section 4 we discuss variations of the preceding constructions
Wolfgang Maass
8
for related models of spiking neurons, and in Section 5 we outline some conclusions from the results in this article. 2 Simulation of Threshold Circuits and Turing Machines by Networks of Spiking Neurons
To carry out computations on an SNN, soiiie assumptions have to be made about the structure of the response and threshold functions of its neurons. It is obvious that for example neurons with response function E,,,,, such that E,,,,,(s) = 0 for all s 2 0 cannot carry out any computation. We will specify in the following a set of basic assumptions, which suffice for the constructions in this article. Some variations of these conditions will be discussed in Section 4. We assume that there exist some arbitrary given constants Amin.A,,,, E R with 0 5 Amin < Amax so that we can choose for each “synapse” (u.11) E E an individual “delay” A,,,,, E [Amin.AmaX] with cl,.,,(x)= 0 for all x E [O. A,,,,,].This parameter A,,,,, corresponds in biology to the time span between the firing of the presynaptic neuron I ( and the moment when its effect reaches the trigger zone (axon hillock) of the postsynaptic neuron u. This time span is known to vary for individual neurons in biological neural systems, depending on the type of synapse and the geometric constellation. The constants Aminand A,,, can be interpreted as biological constraints on the possible lengths of such time spans. No requirements about Aminand Amaxare needed for our construction, except that Amin< Amax. We assume that except for their individual delays the response functions E,,,,,(as well as the threshold functions @,,) are stereotyped, i.e., that their shape is determined by some general functions and (3. which do not depend on 11 or u. More precisely, we assume that we can decide for any pair ( u . zi) E E whether E~,.,, should represent an excitatory “EPSP ~.rspoiisrfiiizction,’’or an inhibitory “IPSP respoiise fiincfioiz.” In the EPSP case we assume that
E~,,,,(A~,,,, + x) = zE(x)
for all x
E
R’
and in the IPSP case we assume that
E~,,,,(A,~,~, + x) = r^ (x) I
for all x E R+.
In either case we assume that cl,,,,(x)= 0
for all x E [O, A,,,,,].
Furthermore, we assume for all neurons
@,,(x)= O ( x )
for all x E RS
ZI E
V
- Vi,
that
Computational Power of Networks of Spiking Neurons
9
Figure 1: Illustration of our notation for the basic assumptions on 0 .F~~ 5' (the functions shown are quite arbitrary and complicated, but nevertheless they satisfy our basic assumptions). We assume that the three functions icE : R+ -+ R+, icl : R+ i {x E + R+ U {co} are some arbitrary functions with the following properties: There exist some arbitrary strictly positive real numbers Tref, Tend, gi, gz, g 3 . 71, 72, 7 3 , L, sup, Sdol\7n with 0 < Tref < Tend, (TI < ~ 7 2< 0 3 , 71 < 72 < 7 3 (see Fig. 1 for an illustration), which satisfy the following five conditions:
R : x 5 0) and 0 : R+
1. 0 ( x ) 2 0 ( 0 ) > 0 for all x E R+, 0 ( x ) = cc for all x E (0.~~~~). and 0 ( x ) = C3(0) < m for all x E [ ~ ~ , , dcc) .
2. ~ ~ (=08 ) ( x ) = 0 for all x E [Q. oo),and there exists some E,, so that 3x E R+[cE(x)= smax] and V y E R + [ E ~ 5 ( ~ic,] ) 3.
+z ) ~ + z() = v
E R+
+ sup z for all z E [-L. L ]
~ ~ ( 0 ,= ~ ~ ( 0 1 )
4. ~
. z for all z
~ ~ (~ ~ 7 sdown 2 )
E
[-L. L]
5 . $ ( O ) = ~'(x)= 0 for all x E [ ~ 3ooj, . $(xj < o for all x E nonincreasing in [O. 711 and nondecreasing in [ T Z .731.
(0.73). E'
is
We assume in addition that 0(0), E ~ ( o ~ cE(a2), ), sup,sdorvn E Q. It should be pointed out that no conditions about the smoothness, the continuity, or the number of extrema of the functions 0,cE, E' are made in the preceding basic assumptions. However, if one demands in addition that cE is piecewise linear and continuous, then conditions ( 3 ) and (4) become redundant. The assumption that 0 ( 0 ) ,sE(a1), ~ € ( r r 2 ) sup, ,
Wolfgang Maass
10
I
:', and :'
Figure 2: Examples tor mathematically \ w y simple functions (3, satisty the basic assumptions.
that
b L I , > , , ,are , rationals will be needed unlp t o ensure that certain weights can be chosen to be r-ntrorrirls (see Section 2.9). Examples of mathematically particularly simple (piecewise linear) functions and (-) that satisfy all of the above conditions are exhibited i n Figure 2. The subsequent construction shows that neurons with the very simple response nnd threshold functions from Figure 2 can, in principle, be used to build an artificial neural network with some finite number i r i 4 of spiking neurons that can simulate in real time any other digital computer (even computers that employ many more than ! I ; , memory cells or computational units). We have formulated the preceding basic assumptions on the response and threshold functions in a rather general fashion to make sure that they can in principle be satisfied by a wide range of EPSPs, IPSPs and threshold functions that ha1.e been observed i n a number of biological neural systems (see, e.g., Fig. 3). The currently available findings about biological neural systems (see, e . g , Kandel c-t (11. 1991, and the discussions in Valiant 199.1) indicate that in general a single EPSP alone cannot cause a neuron to fire. In fact, it is comnionlv reported that 30 to 100 EPSP have to arrive within a short time span at a neuron to trigger its firing. These reports indicate that the weights x i , , in our model should be assumed to be relatively small, since they cannot amplify a single EPSP to yield an arbitrarily high potential P,,. Hence for the sake of biological plausibility one should
:'. .-'
Coinputational Power of Networks of Spiking Neurons
11
mV -68
- 70 - 72 Figure 3: Inhibitory and excitatory postsynaptic potentials at a biological neuron. [After Schmidt (1978). Firrirlnrnrrztals of Neurophysiology. Springer-Verlag, Berlin].
assume that the values of all weights w,,, in an SNN belong to some bounded interval [O. w ~ ~ ,For ~ ~simplicity ~]. we assume in the following that zu,,, = 1. This convention just amounts to a certain scaling of the values of the response functions in relation to the threshold functions. In any version of this model where a single neuron is not able to cause the firing of another neuron, one necessarily has to assume that each input spike is simultaneously received by several neurons (since otherwise it cannot have any effect). In spite of this convention we will occassionally assign much larger values to certain weights zuU,,. We will then (silently) assume that u does 1 that all fire concurrently in fact represent an nssembly of [ 7 ~ , , ~ , neurons ([w) is defined as the least natural number 2 x ) . Furthermore, we assume in those situations that all edges from neurons in this assembly to neuron v have the same delay, and the same weight W ~ , , ~ , / [ WE, ~[0,1]. . ~ ~ The main difference between this type of construction and a construction with arbitrarily large weights is that in our setup the (virtual) use of large weights blows up the number of neurons that are needed. Theorem 2.1. If the response and threshold fuizctiorzs of the tzeurons satisfy the previously described basic assumptions, then one can build from such neurons for any g i z m d E N an SNN N T M ( d ) of finite size that caiz simulate with a suitable assigizmeizt of rafionnl values frorii [O. 11 to its weigkts any Turing mnchine with at most d tapes in real-time. F u v t l z e r ~ n o r e N ~ ~can ( 2 )cornpiite any function F : (0.1)' -+ (0, l}' zuith n suitable nssignment of real values from [0,1]to its weigkts. The proof of Theorem 2.1 is rather complex. Therefore we have divided it into Sections 2.1 to 2.10, which are devoted to different aspects of the modules of the construction. Several of these modules are also useful for other constructions. The global construction of N T M (with ~) the properties claimed in Theorem 2.1 is described in Section 2.10.
Wolfgang Maass
12
We will discuss in Section 4 some methods for alternative constructions of N&(d) that are based on different assumptions about response and threshold functions.
2.1 Conditions on the Neurons. We assume that we can decide for any pair (zi. v) of neurons whether there should be a “synapse” between both neurons (i.e., ( u 3v) E E ) . Self-referential edges of the form ( u .u ) will not be needed. In this proof the weights w,,,,, on edges ( 2 4 . z1) are always assumed to be time invariant, and they are only assigned values from [O. 11. We assume that the response and threshold functions satisfy the previously described basic assumptions.
2.2 Delay- and Inhibition Modules. We will construct in this section two very simple modules that will be used frequently (and often silently) in the subsequent constructions. From the general point of view the existence of these two modules shows that our very weak assumptions about Amin and Amax(we have only required that 0 5 Amin< A,,,) as well as our very weak assumptions about the shape of E’ in condition (5) are in fact sufficient to create in an SNN arbitrarily long delays, and arbitrarily fast appearing or arbitrarily fast disappearing inhibitions of arbitrarily long duration. A ”delay-module” is simply a chain u l . . . . of neurons so that ( z i i . i i i + l ) E E, E,,,,,,,+, is an EPSP response function, and w, ,,,,,,+, := @(O)/E,,, for i = 1.. . . . k. Since each delay A ,,,,,,,+, can be chosen arbitrarily from [Amin.A,,,], the total ”delay” between the firing of u1 and the arrival of an EPSP at iik+1 can be chosen to assume any value in a certain interval of length k . (A,,, - Amin).It will cause no problem that the total transmission time from u1 to uktl grows along with k, since in the subsequent constructions time will essentially be considered only m o d d o a certain constant TPM. We next construct for any given real numbers 6,A > 0 and h: < 0 “inhibition modules” 1 6 . t i . ~and PA. I s , ~ . Acan be used to transmit to any desired neuron zl a volley of IPSPs that sum u p to a potential which changes from its initial value 0 to some value 5 h:, within a time interval of length 6, and then maintains a value 5 h: for at least the following time interval of length A. l D , nconsists ,~ of a neuron 21 that transmits EPSPs simultaneously to several ”relay neurons” u1, . . . uf,which are triggered by this EPSP to send an IPSP to some given neuron v. If I and the delays between the neurons are chosen appropriately [as a function of 6. ti. A, ~’(6)and the parameter 711, this module will have the desired effect on neuron v. Dually, one can also build for any 6, X > 0 and ti, < 0 an inhibition module Ib*“tX that sends IPSPs to any specified neuron v whose sum stays 5 K for a time interval of length 2 A, and then returns to 0 within a time ~
%
Computational Power of Networks of Spiking Neurons
13
Figure 4: Graph structure of an oscillator consisting of one neuron (a) and two neurons (b). interval of length 5 6. Here we exploit the fact that according to condition (5) the function ~ ' ( xis) nondecreasing and strictly negative for x E [7*.n ) . 2.3 Oscillators. Consider subgraphs of an SNN of the structure shown in Figure 4. Both types of subgraphs can be used to build an oscillator. The first one is somewhat simpler, but we will not use it in our construction since it would require a self-referential edge (ZI.11) E E. In the second type of oscillator (Fig. 4b) we assume that w ~ , , ~U ,I ,~ ~2. ~ , O(O)/E,,,~~, and that both E,,,,, and E,,,,, are EPSP response functions. Thus after an initial EPSP through edge a both neurons will fire periodically. More precisely, z) will fire at times t o i. 7r for i = 1.2. . . ., until it is halted by an IPSP through edge 11. We refer to 7r as the oscillation period of this oscillator. We will distinguish one such oscillator as the "pacemnker" for the constructed SNN, which we denote by PM. We write T ~ Mfor its oscillation period. We assume that the oscillation of PM is started at "time 0 by the first input spike to the SNN, and that it continues without interruption throughout the computation of the SNN. PM emits EPSPs through edge e, which will then be broadcast as a timing standard throughout the SNN. We will say in the following that some other neuron ZJ in the SNN fires "at unit fimr" or "synchronously" if the considered firing of z, occurs at a time point t of the form i . T P M for some i E N. In N T M ( d ) we will use oscillators in two ways as storage devices. First we use them as "registers" for storing a bit (via their two states dormant/oscillating), for example in the control of h $ ~ ( d ) .Second we
+
Wolfgang Maass
14
use oscillators 0 with oscillation period TPM to store arbitrary numbers p E [o.T p M ] via their phnse difference to PM (i.e., neuron ZI of oscillator 0 fires at time points of the form i . T P M + p with i E N). In this way oscillators can for example store the time difference between two input spikes to the SNN, and the program and tape content of a simulated Turing machine, respectively. 2.4 Synchronization Modules. A characteristic feature of a computation on a feedforward Boolean circuit of the usual type is that the fiining of its computation steps is independent of the unlues of the bits that occur in the computation. For example, the timing of the output signal of an OR gate does not depend on the values of its input bits. This feature is very useful, since with its help one can arrange that all input bits for Boolean gates on higher levels of the circuit arrive simultaneously, and therefore it allows us to build complex circuits from simple modules. If one wants to carry out computations on an SNN with single spikes, one would like to interpret the firing of a neuron at a certain time as the bit "1" and nonfiring as "0." Thus one might, for example, want to simulate an OR gate by a neuron u that fires whenever it receives at least one EPSP. However, when that neuron receives fzuo EPSPs simultaneously (corresponding to tzuo input bits being 1) it would in general fire slightly earlier than in a situation where it receives just a single EPSP. This effect that are not is a consequence of having EPSP response functions E,,,?,(x) piecewise constant. In addition, if ZI has already fired just before, then the fact that O ( x ) is in general not piecewise constant also contributes to this effect. Unfortunately this effect makes it impossible to simulate on an SNN in a straightforward manner a multilayer Boolean circuit (where the bit "1" is signaled by a spike, and "0" by the absence of a spike at the corresponding time): the input "bits" for neurons that simulate Boolean gates on higher layers of the circuit will in general not arrive at the same time. Furthermore it is not possible to correct this problem by employing delay modules of the type that we had constructed in Section 2.2, since the required length of the delays depends on the current values of the input bits. We will solve this problem with the help of the here constructed synchronization module. In fact, we will show in the next section that with the help of this module an SNN suddenly gains the full computational power of a Boolean feedforward tkreskold circuit, and therefore is able to carry out within a small number of "cycles" substantially more complex computations than a regular Boolean circuit. On first sight it appears to be impossible to build a synchronization module without postulating the existence of an EPSP response function that has segments of length 2 T P M where it is constant, or increases or decreases linearly. However the following "double-negation trick" allows us to build a synchronization module without any additional assumptions. zj
Computational Power of Networks of Spiking Neurons
15
Figure 5: Structure of a synchronization module.
Consider the graph of an SNN on the left hand side of Figure 5. We arrange that as long as no EPSP is transmitted through its "input edge" E, the neuron u fires regularly with period T P M as a result of EPSPs from the pacemaker PM. These EPSPs induce the inhibition module 12 to send IPSPs to neuron ZI that "cancel out" the EPSPs that arrive at u directly from PM. Therefore in the absence of an input through edge e this neuron u does not fire. Assume now that at some arbitrary time point an (unsynchronized) EPSP arrives through edge e. This EPSP triggers the inhibition module 11,which then sends out IPSPs that prevent neuron 14 from firing for a time interval of some fixed length > T P M . Therefore at least one of the EPSPs that arrive at neuron u from PM is not cancelled out by IPSPs from the inhibition module 12, and neuron u emits at least one synchronized spike (i.e., u fires at least once, and with a proper choice of delays only at unit times of the form i . TPM with i E N). A closer look shows that the mechanism of this module is in fact a bit more delicate. It can, in principle, happen that at neuron u the beginning or the end of a negative potential from II coincides with an EPSP from PM in such a way that it leads to a small shift 0 in some firing time of u (besides canceling other firings of u). This could shift the time interval of the activity of l2 by a certain amount p. One has to make sure that this shift cannot lead to a competition at neuron zi between the negative
16
Wolfgang Maass
potential from 12 and the EPSP from PM that results in an uizsynchronized firing of v. One can solve this technical problem by designing 11 and so that their output is the superposition of the output of a module l,,h.X and of a module I"."~'. In this way their strongly negative output potential (of value 5 K ) both builds up and disappears at neuron ZJ within time intervals of length ;1. This parameter 1, provides then an upper bound for the length p of the possible time shifts of these negative potentials. By choosing b sufficiently small (and by arranging the lengths and delays of these inhibitions appropriately), for n i i y arrival time of an input spike through edge e and for any EPSP from PM the resulting inhibition from 12 either cancels the corresponding firing of ZJ, or it lets ZJ fire without shifting its firing time (canceling some other firings of instead). For that purpose one chooses the weight ziJ E [0.1] on the edge from PM to zl so that the resulting function U J . z E crosses O(0) while it is in the middle of its linearly increasing segment [see condition ( 3 ) of our basic assumptions]. The timing of this synchronization module can be specified with more precision as soon as one selects concrete response and threshold functions that satisfy our basic assumptions. However, the preceding analysis shows that it will do its job in any case. One should keep in mind that our basic assumptions are relatively weak. For example, they do not even prescribe the relationships between the sizes of the parameters 03, ~ 3 and , Ten<{ that denote the lengths of the nontrivial segments of the response and threshold functions. It turns out that the previously described module may output not just one, but a larger finite number of synchronized spikes as a result of one unsynchronized input spike. This effect causes no serious problem in our subsequent applications of this module (and it might occasionally be helpful for speeding u p a computation), but it is easier to verify a construction if this module never outputs more than oize synchronized spike for each input spike. This additional requirement can be satisfied . indicated by adding after neuron v a device with three neurons ~ 1 . ~ v32 as in the right-hand side of Figure 5. With suitably chosen delays and parameters for its inhibition module 13, this device removes all except the first spike from any sequence of successive synchronized spikes. It lets the first one of these spikes emerge from neuron u3 as a single syiihronized output spike. ZJ
2.5 Simulation of Boolean Threshold Circuits by SNNs. If one just wants to simulate in a straightforward manner the control of a Turing machine on an SNN, one can reserve one neuron for each possible state of the control, and simulate state transitions with the help of neurons that simulate Boolean AND and OR gates. However Horne and Hush (1994) have pointed out that many fewer neurons are needed if one simulates the control with the help of a Boolean feedforward threshold circuit with gates of unbounded fan-in (see Section 2.8). In addition, the ability of
Computational Power of Networks of Spiking Neurons
17
SNNs to simulate threshold circuits in an efficient manner is of substantial interest for various other reasons (see Corollary 2.4 and the lower bound for the VC dimension of SNNs in Maass 1994b). Therefore we describe the simulation of a threshold circuit on an SNN, rather than considering first the simulation of the special case of a Boolean circuit with gates of bounded fan-in (which would suffice for the proof of Theorem 2.1). A feedforward Boolean threshold circuit (threshold circuit for short) consists of a directed acyclic graph with nodes of arbitrary fan-in, that correspond to linear threshold gates (tkreshold gafes for short) with arbitrary weights. A threshold gate with fan-in m computes a threshold ftiizctioii of the form 111
(0. l},,' 3 (XI?.. . %XI,) H T%(X,,. . . .XI,,)
=
1.
if C a , . x , > n o
0.
otherwise
with arbitrary parameters 00. . . . . o,,, E R (or equivalently: cue. . . . , (I,,, E Z). It is obvious that the common Boolean operations AND, OR, NOT are special cases of threshold functions. Therefore the common types of feedforward Boolean circuits (even with ANDs and ORs of arbitrarily large fan-in) are special cases of threshold circuits. Hence, since every Boolean function can be defined by a Boolean formula in disjunctive normal form (see, e.g., Lewis and Papadimitriou 1981) it is clear that ~weryBoolean function can be computed by a threshold circuit of depth 2 (i.e., with one "hidden" layer). There are several different possibilities for simulating a threshold circuit on an SNN, providing subtle tradeoffs between the amount of demands imposed on the response functions, the noise robustness of the construction, and the number of neurons needed for the simulation. We describe one simple construction based on our basic assumptions, and we will indicate a variation in Section 4. Consider first a "monotone" threshold function, i.e., a threshold function T" with oi > 0 for all "weights" o l . . . . . oj,,.If ci0 5 0 then T" always outputs "1," and is therefore superfluous. Hence we may assume that 0 0 > 0. By condition (2) each EPSP response function s , ~ ,has ~ , some maximal value E~~~ > 0 that does not depend on LL or u. We employ for the computation of T" on an SNN m + 1 neurons 1 1 , . . . . . u,, and u with ( 2 4 : (u.11) E E } = {u,.. . . . unr}. We assume that all response functions E,,+ are EPSPs and that the weights w,,+ are chosen so that wIL+. crmax = (I, . o(O)/ao(slightly larger values should be chosen if one has to deal with imprecision). Furthermore, we assume that the "delays" are chosen to be the same for i = 1. . . . , in. Consider then some arbitrary set S C (1. . . . . n r } . Assume that the neurons u, with i E S fire simultaneously at some time to, that the neurons I I , with i E (1 . . ni} - S never fire in % .
18
Wolfgang Maass
+ + *,I. + +
the time interval [to - ( ~ 3 . to a?],and that neuron u did not fire in the time interval [to all, - Te1,d. to a,,,Then u fires at some point in the time interval ( t o A,,, i,. to All, ?, 03) if and only if CIES wll, . E~~~ 2 (->(0). The latter inequality is equivalent to CItsru,. O(O)/tro > 0(0), hence to C I t s ~>,NO. Thus we have constructed a module of an SNN that computes an arbitrary monotone Boolean threshold function T“. This module has the disadvantage that its proper functioning is guaranteed only if all u, with i E S fire at a cuininun time to. On the other hand the firing time of u depends not only on to, but also on S (i.e., on its ”input bits”). In general a larger set S gives rise to a slightly earlier firing time of u (because the function cE does not jump immediately from 0 to cmax). Obviously these two facts together cause problems if one wants to use compositions of the previously constructed module to simulate a multilayer monotone threshold circuit (i.e., a threshold circuit where all gates compute monotone threshold functions). Therefore one has to use synchronization modules between any two layers of modules to simulate a monotone threshold circuit on an SNN. We will now describe the simulation of an arbitrary threshold circuit C, where threshold functions T’* with ”weights” (I, of arbitrary sign are computed by gates of C. It is well-known (see Hajnal et al. 1993) that such a circuit C can be simulated by a monotone threshold circuit C,, of the same depth, provided that C,,, also receives for each Boolean input variable x, its negation 1 - x,. Proceeding from the input layer to the ouput layer one can then replace each threshold gate g of C by two gates that both compute monotone threshold functions: one of them provides the same output as g, and the other one provides the negation of that output. Thus in order to simulate C on an SNN, one needs in addition to the preceding construction a preprocessing device that computes the negation 1 - x for each input bit x E (0, l} under the considered bit encoding (where ”x = 1” is encoded by a firing of a neuron u at a certain time t, and “x = 0” by the nonfiring of u within a certain time interval around f). For that purpose one connects u to an inhibition module whose outputs cancel out an EPSP from PM at another neuron u’ (similarly as in Section 2.4). Then u’ will fire if and only if it is not inhibited via a firing of u, hence u’ computes ”1 - x.”
+
+
2.6 Modules for Comparisons and Multiplication of Phases with Arbitrary Constants. We will construct in this section a module for an SNN that can compare the phase difference p of an oscillator 0 with some given constant a [COMPARE(> n ) ] ,and a module that can multiply p with some given constant [MULTIPLY(P)]. Such modules [more precisely: modules for the operation COMPARE(> 2-l-‘) for a certain constant c, as well as modules for MULTIPLY(2) and MULTIPLY(1/2)1 will be needed in the next section to simulate a stack on an SNN.
Computational Power of Networks of Spiking Neurons
t*
t
19
>
Figure 6: Mechanism of the module for COMPARE( 2 a ) .
Let o E [O; L / 2 ] be some arbitrary real constant. We construct a module that can decide whether the phase difference p E [O.L/2] between PM and some oscillator 0 with oscillation period T ~ Mis > N. More precisely, this module for the operation COMPARE(> cr) will send out a spike within some time interval of some given length 2 0 if and only if p 2 0. Consider neurons ~ 1 . 2 1 2 .and z' with ( u i . 2 ~ )E E for i = 1.2. Assume that ill is induced to fire at a certain time tl by a spike from the pacemaker I'M. Furthermore, assume that 112 is induced to fire at a certain time t 2 by a spike from the oscillator 0. Finally we assume that the delays A,,,,,, and Al12,,, have been chosen so that in the case p = Q one has for t, :=!, +A,,,,,,that there exists some f * 2 max(i1,j2)so that t* -tl = o1 and 'f - f ? = 02. We choose weights ,,,. > 0 so that UI,,,.~,. sup= 7 ~ , , ~ . ~.. s ~ I and w,,,,~, . c E ( a l ) w , , ~ s, E~(.0 2 )= O(0) (see Fig. 6). According to our general convention at the beginning of this section we actually have to replace in the case z(~,,,,, > 1 the neuron iii by an assembly of [ r 0 ~ ( ~ , ~neurons ,1 with weights from [O, 11 on their edges to u. However, for the sake of simplicity, we will ignore this trivial complication in the following. We arrange that for - - an arbitrarily given parameter 0 > 0 inhibition modules I b , h , ~ and I"'.' (with suitable values of their parameters) are triggered by spikes from I'M to send II'SI's to u so that u is not able to fire within the time intervals [f*-L/2-D, t ' - L / 2 ) and (f*+L/2, t*+L/2+D\ even if the firing time t2 of neuron u2 is arbitrarily shifted, but so that
+
ZLJ
~ ~ . ~
Wolfgang Maass
20
these inhibition modules have no effect on the potential P, at neuron ZI during the time interval [t* - L/4. t* + L/4]. Consider now what happens if the phase difference 9 of the oscillator 0 is not fixed at p = 0, but assumes any value in [0,L/2]. Then by choice of the parameters ZLJ,,, ?). w , , ',~, and f*, and by the conditions (3) and (4) of our basic assumptions, the sum of the EPSPs from u1 and u2 at neuron ZI has in any case a constant value within the time interval [f*-L/2, f*+L/2]. Furthermore, this constant value is 2 O(0) if and only if p 2 a . Hence the neuron u will fire within the time interval [f* - L/2. f* + L/2] if and only if p 2 (1. Furthermore, by the choice of the inhibition modules the neuron I J fires within the time interval [f* - L/2. f* + L/2] if and only if it fires within the time interval [f" - D.t* + D ] . We now assume that some arbitrary real number j > 0 is given, and we construct a module that carries out the operation MULTIPLY( -I). This module also consists of neurons 1 1 1 . u 2 . u with ( z i z . z i ) E E for I = 1.2 so that 1 4 1 is triggered to fire at some time fl by a spike from the pacemaker PM, and u2 is triggered to fire at some time f2 by a spike from an oscillator 0 that has oscillation period 7rpM and some phase difference E [O.min(L/2,L/2/j)]to PM. We want to achieve that for any value p E [0,min(L/2. L/2d)] of this phase difference the "output neuron" 11 of this module fires at a time t + /j p, where f does not depend on p. The construction of the module for the operation MULTIPLY( j ) is slightly different for the two cases J > 1 and 1 E (0.1 I We consider first the case 11> 1. Assume for the moment that the phase difference 9 E [O. L/2,3] between 0 and PM has value 0, and choose delays A,,, so that there exists for i, := t, A,,,1, some 'f 2 max(il.I,) with 'f - ;I = "2 and f * - i2 = 01. Furthermore, we choose weights ZL'~,! 1 , > 0 so that L,
+
ZO,,, 7,
. 2 ( t * - 71)
+ ZUlf1,
.€E(t*
-
i,)
= O(0)
(2.1)
and (2.2) Since j > 1, equation 2.2 implies that 0 < ~ we have
1
. sdown < w,,? <, . sup. Hence ,
~
~
~ i ~ , , , ~ , ~ ~ ~ ( t * - ~ ~ + z ) + z i ~<, O(0) , ~ ~ , for ~ ~ all ~ (z t E* [- - ~L O~ )+ : (2.3) ) We would like to arrange that u does not fire during the time interval [f* - Tend. f*), where has the property that (->(x)= @(o)for all x E [ ~ ~m~) d[according . to condition (l)].Furthermore, we would like to make sure that this property holds even if the firing of 112 is delayed by some arbitrary amount p E [0,L/2/j]. However, even if one assumes that only the considered EPSPs from 111 and 242 are influencing P,,(fj,this assumption allows us to derive this fact only with the help of equation 2.3 for the interval [t* - L/2. t * ) , since we did not make more detailed assumptions about the shape of the function cE. Therefore we arrange that
Computational Power of Networks of Spiking Neurons
Figure 7 Multiplication of a phase y with B > 1 (i.e., t,
21
-
f* = i ? . 9).
at a suitable time an inhibition module ILI2," ' ~ n d sends IPSPs to u,which makes it impossible for u to fire during the time interval [t" t* -L/2) (no matter at what time u2 fires), but which does not influence the potential P,,(t) at times t 2 t'. Furthermore, we arrange that no other EPSPs or IPSPs contribute to P U ( t )for t E [t* f*]. In this way u can fire during the time interval [f* - Tend. t ' ) (even if the firing of u2 is delayed by some p E [O. L/2/??]).Therefore in the case p = 0 our assumption (equation 2.1) implies that neuron u will fire at time f*. We now consider what will change if the firing of u2 at time f2 is replaced by a slightly later firing at time t2 + p, whereas the firing time of u1 and of the inhibition module remain unchanged. We will show that for any p E (0.L/2@ this delay will cause a somewhat delayed firing of u (see Fig. 7). Consider the time point t,, which is defined by the equation wlgl,*, ."(t,
- tl)
+ wuz
D
'
"E[t,
-
(t, + p)] = O(0)
(2.4)
By equation 2.1 and conditions (3) and (4) of our basic assumptions we have for t , - f * E [-L.L]
w,,3
'
"(f,
-
i,) = w,,z,
and for 9,f, with f, - t'
'
-
EE(f* - TI) -
w,,,, ' Sdown (t, '
-
f')
(2.5)
p E [-L. L] we have that
w " ~ ~ , . E ~ [ ~ , - ( ~ ~ + ( P ) -] /=~w) + ~ w ~ ,~~~~. , E . (s~t, ,(-~~t *.-* p )
(2.6)
Wolfgang Maass
22
Figure 8: Multiplication of a phase 9 with H E (0.1). These two equations in conjunction with 2.1, 2.2, and 2.4 imply that
t 9 - t* = I ] ' $ , It is obvious that for 9 E [O. L/2P] one has that d . p, ,I . 9 - p E [-L. L ] . Furthermore, it is clear from our construction that zl cannot fire during the time interval [t" t* + d . 9). Therefore t , := t* + /j.p is in fact the firing time of ZI if 1 ~ 2fires at time t2 + p. Hence the described module carries out the operation MULTIPLY(/j)in case that ,j> 1. To carry out the operation MULTIPLY(1j) for some arbitrarily given ,I E (0.1) we just change the delay A,,, l , in the previously described module so that t' - i, = g1 (instead of t* - i, = ~ 2 (see ) Fig. 8). We choose weights ZU,,,,~,> 0 so that equation 2.1 holds and (2.7) As before, we consider the time point t, that is defined by equation 2.4. Then equation 2.6 holds, but instead of 2.5 we have 7/7,,,,,,
'
sE(f,
-
i,) = w,,,,,, . sE(t* - i,) + w,,,.,, . s"p
'
(tP - t * ) .
The latter two equations in conjunction with 2.1, 2.4 and 2.7 imply that
fP
-
t* = / j . p .
Hence the described module carries out the operation MULTIPLY ( / j ) for an arbitrarily given /-I t (0.1).
Computational Power of Networks of Spiking Neurons
23
2.7 Simulation of a Stack with Unlimited Capacity by an SNN of Fixed Size. The simulation of a stack (also called pushdown store, of first in-last out list) is the most delicate part of the construction of NTM(d), since it requires the construction of a module in which the lengths I' of the bit-strings (b,, . . . bl) that are stored and manipulated are in general much larger than the number of neurons in this module (in fact, (I can be needs to have a component with this arbitrarily large). Of course NTM(~) property, since otherwise the SNN N T M ( d ) (which will consist of a fixed finite number of neurons) cannot simulate the computations of Turing machines that involve tape inscriptions of arbitrary finite length. The content ( b l . . . . bt) E {0.1}* of a stack S (where bl is the symbol on top of the stack) will be stored in the form of the phase difference %
e
ps =
C b, . 2-'-' i=l
of a special oscillator 0 s . More precisely, we assume that 0 s fires with the same oscillation period T P M as the pacemaker PM, but with a delay ps. The parameter c E R+ is some arbitrary constant that is sufficiently large so that 2-' 5 min(L/2, T P M ) . We will now describe the mechanisms for simulating the stack operations POP and PUSH on a bit string (b,~.. . .be) that is stored in ps. The stack operation POP determines the value of the top-bit bl, and then replaces the stack content (b, . . . .br) by (b2. . . . ,bc). In an SNN one can determine the value of bl from ps by testing whether 9 s 2 2-'-'. For that purpose one employs a module that carries out the operation COMPARE(2 2-'-') (see the preceding section). To change the phase-difference ps from & b, .2-'-' to Cp=;' bi+l.2-'-' one first replaces cps by bi.2-'-' . For the case bl = 1 this can be carried out by directing an EPSP from 0 s through a suitable delay module, by halting simultaneously the oscillation of 0 s with the help of an inhibition module, and by restarting the oscillation of 0 s with an EPSP from the considered delay module. Note that we can employ at this point a simple delay module as described in Section 2.2, because in the case bl = 1 the length of the desired shift of the phase difference does not depend on its current value. It remains to carry out a SHIFT-LEET operation, which replaces the phase difference C,=* b . . 2-1-C by ~
xf=2
2.
c
I-1
r=2
i=l
1b, . 2-i-c = 1bi+l . 2-i-c,
This operation cannot be implemented by a delay-module, since it has to shift the phase difference by an amount that depends on the values of l and b2, . . . ,be. Instead, we have to employ a module that carries out the operation MULTIPLY(2) (see Section 2.6).
Wolfgang Maass
24
To simulate the stack operation PUSH one has to replace for a given ho E (0.1) the current phase-difference ps = b, . 2-'-' of the oscillator 0 s by 1 ; : ; b,_l . 2-'-c. Our simulation of PUSH consists of two separate parts: a SHIFT-RIGHT operation that changes the current phase b,-l . 2+', and a subsequent ADD(?) operation that difference to C,"=',' adds y := bo . 2-1-c to this phase difference. Obviously ADD(?) can be implemented in an analogous way as the subtraction of bl .2-'-' from pi in the previously described simulation of POP. Thus it just remains to simulate a SHIFT-RIGHT operation, i.e., to replace the phase difference ps = Cf=,b, . 2-r-Lof size 5 L / 2 by ps/2 = c,(+I =2 k l f 2-I-[ . For that purpose we employ a module for the operation MULTIPLY(1/2), as constructed in the preceding section.
cf=l
2.8 Simulation of an Arbitrary Fixed Turing Machine by an SNN. We will show in this section that the previously constructed modules suffice to construct for any given Turing machine M an SNN NM(whose structure may depend on M ) that can simulate M in real-time. According to the notion of a real-time computation (see Section 1) we assume that the given Turing machine M processes a sequence ((x(j),~(j))),~~ with XI)).y(j) E (0. l}" in real-time. We assume that the inputs xu) are presented to M on a read-only input tape, and the outputs y ( j )are written by M on some write-only output tape. We will assume that the simulating SNN NMreceives each input x ( j ) E {0,1)*in the form of a time difference p between two input-spikes, with p = b, .2-'-' for XI)) = ( b l , . . . . b,). We will arrange that NMdelivers its outputs yI)) in the same form (as a time difference between two output spikes). It is easy to see that any Turing machine M , with any finite number d of two-way infinite read/write-tapes, can be simulated in real-time by a similar machine which has 2d stacks, but no tapes (see, e g , Hopcroft and Ullman 1979). We will call the latter type of machine also a Turing machine. In this simulation one uses two stacks for the simulation of each tape: one stack for simulating the part of the tape that lies to the left of the current position of the tape-head and another stack for simulating the part of tape to the right of the tape-head. In principle it would suffice to consider a Turing machine with 1 tape (or 2 stacks), since this type of Turing machine can simulate any other Turing machine (although not in real time). However, it is known that various concrete problems (especially several pattern-matching problems) can be solved faster on a Turing machine that has more than one tape (see, e. g., Hopcroft and Ullman 1979; Maass 1985; and Maass ef al. 1987). Therefore, and because it does not cause any extra work, we simulate an arbitrary Turing machine M with any number k of stacks by an SNN N M . At any computation step the Turing machine M may POP or PUSH a symbol on each of its k stacks. We assume for simplicity that the stackalphabet of M is binary (i.e., M can push 0 or 1 on each stack, and pop
cf=l
Computational Power of Networks of Spiking Neurons
25
a binary symbol, or receive the signal "bottom-of-stack if the stack is empty.) Furthermore, we assume that the input for the computation of M is given as the initial content of the first one of the k stacks, and that the output of M consists of the final content of the last one of the k stacks (at the moment when the machine halts). If Q is the (finite) set of states of M, then after assigning a number in binary notation to each state in Q the transition function of M can be encoded by a function FM : (0. l } ~ ' n g ~ Q l ~ --+ + k (0. I}l"glQllfk. We assume here that the state of M indicates on which of the stacks a POP or PUSH has to be carried out. Thus to simulate the finite control of M by an SNN, it suffices to employ a module that can compute an arbitrary given function from (O.l}llnglQll+k into itself. We assume here that the rloglQl1 + k input and output bits of this function are stored in a corresponding number of oscillators with two states (dormant/oscillating). According to Lupanov (1973), one can compute any function F : (0. l } ~ l o g I Q l l + k + (0. l}ilnglQll+k on a feedforward threshold circuit with O(IQ/'/*.2k/2)gates. In addition, Horne and Hush (1994) have shown that any such function F can be computed by a threshold circuit of depth 4 with O[IQ/'/2.2k/2.(log lQl+k)] gates, using only weights and thresholds from (-1.0. I}. Hence our previously described simulation of an arbitrary threshold circuit on an SNN in Section 2.5 allows us to simulate in NM the finite control of M with a module of O[lQI'/' . 2k/2]neurons (provided the SNN may use arbitrarily large weights). Furthermore, the quoted result by Horne and Hush in conjunction with our construction in Section 2.5 implies that with O[IQ1'/2. 2k/2. (log IQI + k ) ] neurons one can implement in N M the finite control of M in such a way that only very simple weights from [O, 11 are needed in NM, and that the simulation of each computation step of M requires only O(1) "machine-cycles" of N M . More precisely, each computation step of M is simulated by NM in a time interval in which the pacemaker I'M fires 5 K times, where K is some absolute constant that is independent of IQl,k, the length of the current input of M, and the number of the previously simulated computation steps of M. Apart from the finite control component, the SNN J\/M consists of a module of O(1) neurons for each of the k stacks, and O(1) neurons that implement the pacemaker PM. In addition N M uses O(1og IQI + k ) neurons for other oscillators that serve as temporary registers for bits. Thus N M consists altogether of at most O[IQ11/2. 2k/2. (log IQI + k ) ] neurons, and the simulation of any computation step of M involves at most O[lQ]1'2. 2k/2. (log IQI + k)] firings of neurons in NM.After NMhas simulated every computation step of M on the current input x ( j ) E {0.1}*, it has generated on an oscillator OS,which corresponds to the stack S on which M writes its output y(j) = ( b l , . . . .bi), a phase-difference cps = ZfZlb,.2-'-' with regard to the pacemaker PM. NM outputs two spikes, where one is generated by PM and the other one by OS,before receiving its next input. Since for fixed M the parameters IQI and k can be viewed as constants,
Wolfgang Maass
26
I
-
.........
I
t + t,
--t
Figure Y: Mechanism of the weight-to-phase transformation module.
.\*,,just uses Oi 1
)
spikes tor the simulation of each computation step of
M. Hence .I* simulates ,, M in real-time.
2.9 Weight-to-Phase Transformation. At this point the only missing link for the construction of the desired SNN . l & ( d ) is a module that allows us to generate from suitable weights of an SNN the encoding of arbitrarily long (even infinitely long) bit strings, which may, for example, represent the program of a Turing machine, or an infinitely long "lookup table." The weight-to-phase trmsformation module constructed here will be able to generate within a fixed number of "machine cycles" any given phase difference ;= C:=, b, . 2-'+' of an oscillator (for arbitrary i E N u { x } and k , E ( 0 . 1 ) ) from suitable weights between 0 and 1. Furthermore, these weights can be chosen to be rational if / E N. This module will exploit effects of the firing mechanism of a neuron in an SNN that are closely related to those that we had used in Section 2.6 to multiply the phase of an oscillator with a constant factor. To allow a i i r f i q f i e decoding of ;rifin;fd,g /mis bit sequences from phase differences ; we adapt the convention that h1== 0 for all i E N in case that I = x . We consider the same configuration with neurons i f 1 . 1 1 2 . ~ 1 , and an inhibition inodule as for MULTIPLY( . j ) in Section 2.6. However, instead of shifting the firing time of i f ? , we are now interested in the consequences of multiplying the weight on the edge from i l l to 5' with some factor ii' E i0. 1) (see Fig. 9). We choose values for the delays so that - tor t , := t , 7 A?(:. there exists some t' >_ max(t1.t 2 ) with t' - t l = m2
Computational Power of Networks of Spiking Neurons
27
and t* - i, = ( 7 1 . Furthermore, we choose positive weights w,,, ,, so that w,,,sup = 2w,,,l 3 . Sdown and zL7,,, <, € L ( t * - ? I ) + w,,, ', 2 ( t * - i,) = 8 ( O ) . To analyze the consequences of multiplying the weight w,,, ,, with some w E [O.1], we consider for arbitrary w E [O.1]the point t,,, > f * that satisfies i3
'
'
Together with the preceding equations and conditions (3) and (4) from our basic assumptions on cE this yields
or equivalently
Then analogous arguments as in Section 2.6 show that if w E (0.11 is chosen so that the right hand side of this equation has a value in [O. L/2], then the value for t,,, that results from this equation is, in fact, the uniquely determined firing time of u in [t*.t*+L/2]if the weight on the edge (u1.v) is multiplied with w. In particular, the value t,,, - t* = L / 2 of the shift in the firing time of u is achieved for
Thus wLE [O. l), and the function w H f,,, [O. L / 2 ] . The inverse of this map is defined by
-
t* maps
[ W L .11
one-one onto
One can derive from the basic assumptions on 0 and E~ that w,,,z, E Q. Hence the preceding formula in combination with these basic assumptions implies that one can achieve any rational phase shift t, - t* E [0,L/2] with a rational weight w . w,,?, on the edge (u1.v). Finally, by our choice of c one has b, . 2-'+' E [O. L/2] for any values of P E N U {co} and b, E (0. l}. Hence in a preprocessing phase of an SNN any given finite or infinite bit sequence ( b l , b2.. . .) can be "loaded [with only O j l ) spikes involved] from the value of a certain weight of the SNN into the form of a phase difference cps = CP=, b, .2-'-' of an oscillator 0 s . For that purpose one has to ensure that the considered firings of neurons u, and u2 (as well as of the involved inhibition module, see the corresponding construction in Section 2.6) are triggered by EPSPs from the pacemaker PM. Thus we have shown that the weights of an SNN can essentially play the role of a "read-only memory" of unlimited capacity.
xf=l
Wolfgang Maass
28
2.10 Construction of N T M ( In ~ )this . last part of the proof of Theorem 2.1 we construct an SNN NTM(LI) that has those properties that are claimed in Theorem 2.1. Let d t N be any given constant. Let Mu be a "universal Turing machine" with d 1 tapes that can simulate any Turing machine with d tapes in real-time. More precisely, MLI is a Turing machine that receives two finite binary strings s and e on two different tapes as input, and which simulates for any e E {0.1}* the d-tape Turing machine whose program is encoded by e in real-time on input x (with some suitable default convention for the case that e is not the encoding of some Turing machine program). The construction of such universal Turing machines Mu is a standard part of the proof of the time hierarchy theorem for Turing machines (see, e.g., Hopcroft and Ullman 1979, or Lewis and Papadimitriou 1981). The desired SNN N T M ( d ) will basically be the SNN that one gets by applying the construction from Section 2.8 to the Turing machine M := Mu, but with 2d + 2 stacks instead of the d 1 tapes. The only additional work that remains to be done to satisfy the claim of Theorem 2.1 is to change the way in which NM,, receives its input. Ordinarily NM,, would expect to get its second input e = ( e l . . . . . e l ) E {0.1}* in the same way as its first input x E {O.l}*, in the form of two input spikes with time distance Cf=,el .2-'-'. In contrast to that, the constructed SNN N&(d) receives only a single input x in the form of a time difference between two input spikes. On the other hand its weights may depend on the simulated Turing machine M. Thus we may choose a rational weight zu E [O. 11 that can be transformed with the help of the module from Section 2.9 into a phase difference t,,, - f* = C' 1=1 eI .2-!-(. This transformation can be carried out in a preprocessing phase within O (1) firings of PM. After that, the computation of N T M ( d ) proceeds exactly like that of NM,,. To prove the second part of the claim of Theorem 2.1, one exploits the obvious fact that niiy function F : {0.1}* + {0.1}* can be computed by a Turing machine MF with infinitely many bits of "advice," i.e., by a Turing machine MF that has at the beginning of each computation on one of its tapes the same infinite sequence (el)rENof bits el E (0.l} as initial tape inscription. This sequence (e!)rtN may for example encode a look-up table for all pairs ( x . F ( x ) ) , x E {O.l}*. We may assume that (el)ltNalso encodes the program of the Turing machine MF,and that MF altogether has only 2 tapes. As usual, the Turing machine MFreceives on another tape the input x E {0,1}*. To simulate this Turing machine MF on the SNN N T M ( d ) , we just have to equip N T M (with ~ ) a suitable real weight w E [O. 11 that can be transformed (as described in Section 2.9) in a preprocessing phase within O(1) firings of PM into the phase difference f71' - f* = C"r=l eI . 2-2r-c of an oscillator. After that, N T M ( ~ will ) simulate the computation of the Turing machine MF (with initial tape content (el)lEN on one of its tapes) in the usual manner. Thus N T M ( will ~ ) output F ( x ) for any given input x E {O>l}*. Hence N&(d) can compute the
+
+
Computational Power of Networks of Spiking Neurons
29
(arbitrarily given) function F : (0:1)" + (0;1>*. This concludes the proof of Theorem 2.1. 0 An important measure for the complexity of a neural network in the context of learning is its Vapnik-Chervonenkis dimension (VC-dimension), see Vapnik and Chervonenkis (1971). Various results from statistical theory suggest that the VC-dimension of a neural net is proportional to the number of examples needed to train that neural net (for references and a brief survey see, e g , Maass 1995a).
Corollary 2.2. One can construct with any type of neurons whose response and threshold functions satisfy our basic assumptions an SNN N offinite size, so that the VC-dimension of the class of Boolean functions that are computableon N (with different assignments of rational values from [O. 11 to its weights) is infinite. A proof of the following result is contained as a special case in the proof of Theorem 2.1 (see especially Section 2.8).
Corollary 2.3. A n y deterministic finite automaton with q states can be simulated in real-time (both for decision problems, or with intermediate output as a Mealy or Moore machine) by an SNN with O(q'/2)neurons for zuitk 0(q1/*.logq) neurons if only weights from [O: 11 are permitted]. The following corollary exhibits another result of independent interest that was shown in the preceding proof (Section 2.5).
Corollary 2.4. One can construct with any fype of neurons whose response and threshold functions satisfy our basic assumption for any given feedforward Boolean fhreshold circuit C with arbitrary weights, s gates and d hidden layers an SNN Sc with O(s) neurons that simulates any computation of C within a time interval of length O ( d ) .Furtkermore, one can also simulate C witlziii time O(d)by an SNN Sl, with polynomial(s) neurons that uses only weights w E [O. 11. Finally we observe that an application of the techniques from the proof of Theorem 2.1 to SNNs with discrete time (see the definitions in Section 1) yields the following result.
Corollary 2.5. One can construct for any Turing machine M with any type of neurons whose response and fhreshold funcfions satisfy our basic assurnptions an SNN NMso that for any s E N the S N N NMwith discrete firing times from { i 11, : i E N} for some pawith 1/ p = 2'+*(') and Amax- Amin2 211,can simulate in real-time arbitrary computations ofM that involve at most s tape cells of M . For the proof of Corollary 2.5 one exploits the fact that because of the condition Amax- Amin2 2 ~ 1the same construction as in Section 2.2 yields modules that achieve any given real-valued (!) delay 2 Amin. With the help of such delay modules one can then arrange that the time points t' in the subsequent constructions of other modules, as well as the time points when the EPSPs reach their maximal value E~~~ (for the simulation of a
Wolfgang Maass
30
threshold circuit), all belong to the set {i . / / : i E N}. For the simulation of a stack of a Turing machine hrl the construction from the proof of Theorem 2.1 works without changes for SNNs with discrete time steps of length 11, provided that 2-'--' E {i : i E N} for the maximal length l 0 of any bit string that is stored in a stack of M. 3 Beyond Turing Machines _ _
-~
We have shown in Theorem 2.1 that one can build from arbitrary neurons, whose response and threshold functions satisfy certain basic assumptions, an SNN that can simulate any Turing machine. However SNNs are strictly more powerful than Turing macliines for two reasons:
1. An SNN can receive RW! numbers as input, and give red numbers as output (in tlie form of time differences between pairs of spikes). 2. We had constructed in Section 2.6 modules for a n SNN that can carry out the operations COMPARE ( 2( 1 ) and MULTIPLY(, j ) , for a wide range of constants and . j , applied to arbitrary real-valued arguments j from a certain interval. If one applies, for example, such an operation to a phase of the form j = I:=, b,,2-' ', then such a module executes with Oi 1 ) spikes an operation that involves tlie ii1hok bit string ( 1 1 , . . . .j of arbitrary length I E N U { x}. In contrast to that, any Turing machine operation can affect at best a constant number of its stored bits. In this section wc' will show that in addition one can construct modules for an SNN that ADD, SUBTRACT, or COMPARE any two real valued phase differences ; : E lo. L 141 of two different oscillators. This result turns out to be quite important, since in combination with (1) and (2) it implies that one can simulate in real-time on an SNN any RAM with finitely many registers that stores in its registers arbitrary real nunibers of bounded absolute value, and that uses arbitrary instructions of the form COMPARE, MULTIPLY( j), ADD, SUBTRACT. Furthermore, such a n SNN can be built with any type of neurons whose response and threshold functions satisfy the basic assumptions from the beginning of Section 2. On tlie other hand, according to Maass (1994b, 1995c), any SNN with rirbifr17rypiecewise linear response and threshold functions can be simulated in real-time by the same type of RAM. Hence, tlie computational power of these RAMS (which we will call N-RAMS because of their close relationships to neural networks) matches exactly that of SNNs whose response and threshold functions are piecewise linear c 7 r d satisfy our basic assumptions. One can also show through mutual real-time simulations (see Maass 1995b,c) that the computational power of N-RAMS (and hence of the above-mentioned SNNs) matches exactly that of recurrent nrinlog neural
Computational Power of Networks of Spiking Neurons
31
nets with discrete time and piecewise linear activation functions (see Siegelmann and Sontag 1992). More precisely, any analog neural net with any piecewise linear activation functions can be simulated in realtime by an N-RAM; for the simulation of N-RAMS by analog neural nets one can employ, for example, the linear saturated activation function together with the heaviside activation function in the analog neural net. This result implies as a side-result that these two activation functions together are "universal" for all piecewise linear activation functions in recurrent analog neural nets (since they allow such a net to simulate in real-time any other recurrent analog neural net with arbitrary piecewise linear activation functions). Hence N-RAMS also provide a very useful intermediate link for the comparison of SNNs (modeling spike coding) and analog neural nets (modeling frequency coding). We defer the detailed discussion of N-RAMS [which are somewhat weaker than the well-known model of Blum et al. (1989) and also related to the computational model considered in Koiran (1993)] and the proofs of the above-mentioned results to a subsequent article (Maass 1995~). However we will describe in this section the construction of SNN modules for the operations ADD, SUBTRACT, and COMPARE, since those constructions are closely related to the preceding constructions in this article. These constructions provide the tools for the real-time simulation of N-RAMS by SNNS. Consider two oscillators 0 1 and 0 2 of an SNN, both with oscillationperiod QM. Let 9,be the phase difference between 0,and the pacemaker PM, i = 1 2. We construct a module that receives a spike from each of the oscillators 0 1 and 0 2 , and which is then able to kick-off a third oscillator 0 with oscillation period T P M in such a way that it will have phasedifference p1 + 32 to I'M. This module for the operation ADD employs a similar arrangement of three neurons ul. u2. and z, as the modules for COMPARE(2 (Y) and MULTIPLY((1) that were constructed in Section 2.6. We assume that neuron u, is triggered by a spike from oscillator 0, to fire at a certain time t,. We choose delays A,,,,,in such a way that for ?, := f, + A,,, there exists some t* 2 max(il.i,) so that t" - i, = t* - i~ = "1 in case that pl = 92 = 0. We choose w > 0 so that 2w.~ ~ ( 0 =1 O(O), ) and = w,,, = w.We also add an inhibition module, which makes we set w,,, it impossible for u to fire within the time interval [f' - ~ ~ , , dt", - L / 2 ) for any values of 31. 92,and which has no influence on P D ( t )for f 2 t* [as in the construction for MULTIPLY(,]) in Section 2.61. Then for arbitrary values (PI. cp2 E [0,L/2] the neuron u fires at a time t X E [O.L/2] such that ~
{,
w s"p [tc - (t" + 9 1 ) ] + w sup [tc - (t' '
or equivalently
'
+ p2)] = 0.
32
Wolfgang Maass
,
sumofboth
e(o,t. . . . .fimctions ... .. .
3 ' . . . . . . . . . . . ./.
:.:.7/.
..
W'
Figure 10: Mechanism of the module for ADD.
(see Fig. 10). The factor 1/2 of pl + 9 2 can be removed with the help of a subsequent module for MULTIPLY(2)(see Section 2.6). In this way the module for ADD constructed here can generate an output spike at time I . TPM (91 $ 2 ) for some 1 E N. The construction of a module that computes the differriice p1-32 of the 1 and 0 2 with PI 2 p2 is quite phase differences ~ 1 . 9 9 2of two modules 0 similar. For arbitrary given values p1, p2 E [O. L/2] with pl 2 32 we first employ a module MULTIPLY(1/2) that replaces y 1 by $1 := p1/2. For an arrangement of neurons ul. 112. u as for ADD we choose delays All! so that f* - T I = nI and t*-i2 = 02 in case that pl = p2 = 0, and weights w,,, ?,, x+,,I , so that 744,, i 7 ' ~ , p= 2zul12z r . ~ d o wand n w,,, , , . E ~ ( ~ +w,,, I ) r , . ~ E ( n= 2 (-)((I). ) Furthermore, we employ an inhibition module that makes it impossible for neuron z1 to fire within the time-interval [t' - Tend, t" - L/2] for any values of y1 $2 E [O.L/2], but that has no influence on P<,(f) for f > t*. Then for any phase differences pl 92 E [O. L/2] with y 1 2 92 the phase difference 91 is first transformed to = 9,/2. Neuron u1 receives a spike with phase difference $1, and 112 receives a spike with phase difference p2. The resulting firing time f a of neuron 11 is determined by (see Fig. 11)
+
Zl'ii,
+
11
'sup
'
[fa-
(f*
+ $I)]
- ~L'I,~,zI ' Sdown
'
[fa - ( t *
+ $9211 = 0
This yields f a - t* = 231 - 9 2
= $31 - y2
Finally, it is easy to see that the module for COMPARE(> ( I ) from Section 2.6 in combination with the preceding module for SUBTRACT
Computational Power of Networks of Spiking Neurons
33
+ f - (f*
+ @J)
+ t - (t* + 93)
Figure 11: Mechanism of the module for SUBTRACT.
allows us to build a module for the test COMPARE, i.e., a module that decides for any two given phase differences pl.32 E [O. L/4] of two oscillators 0 1 and 0 2 with oscillation period TTT~Mwhether pl > 3 2 . For that purpose one first transforms pl with the help of a delay module to p; := 9 1 L/4. It is then clear that 3; 2 p2, and the module for SUBTRACT can be employed to compute p; - 9 2 = p1 - 92 L/4. With the help of a subsequent module for COMPARE(> L/4) we can then decide whether p1 - 992 L / 4 > L/4, i.e., whether p1 2 3 2 . Of course one can also build directly a module for COMPARE by using a variation of the construction for COMPARE(> 0)in Section 2.6.
+
+
+
4 Variations of the Constructions for Related Models of Spiking Neurons We have assumed for the constructions in the preceding two sections that the response and threshold functions are stereotyped, i.e., that apart from their individual delays A,,u the functions E,, L, and 0, all have the same shape. This assumption is convenient, but not really necessary for the preceding constructions. The same constructions can also be carried out if these functions are differed for different edges (u.v) E E and different v E V . More precisely, it suffices to assume that the response functions F , , are defined with the help of individual delays A, l , and iizdzvicfual functions &fZ, and ~ f ,,, so that E~ u(x) = 0 for x E [0,A, and E,, u ( A u<, x) = E: ll(x), (,
+
Wolfgang Maass
34
+
and, respectively, F ~ ~ , ~ , ( Ax) ~ ,=. , F;,,?,(.Y) , in the case of an IPSP, where the functions &,. &. 0,)satisfy the basic assumptions from the beginning of Section 2. However, these functions E;,,,, F : ~ , ~ ,and , (-I,, may be arbitrarily differcwt, with different values of the parameters r,,f, ~ , " d . 0,. q % L. sup.sdOwn, for different neurons U . Z J (in fact one may assume that these functions are chosen by an "adversary"). Under these relaxed conditions we have to assume, however, that zue can choose arbitrarily large delays All.,, and weights ZL~,,,,,Rfter the individual functions E : , ~ > , and OT, are given to us. Of course one can trade off parts of the latter condition against some quite reasonable conditions on the individual functions F:,,,. :f,.?,, and el,. One can also replace the basic assumptions at the beginning of Section 2 by some alternative assumptions about E:,~,. ~ f , , ~and , , @:,. For example, one can postulate the existence of suitable linear segments of ~ f , , ~ , or OZ,, and then exploit at the neuron u in the module constructions of Sections 2 and 3 a "timing-race'' between an EPSP and an IPSP, or between an EPSP and the declining part of @,, (instead of the race between two EPSPs). Without a "reset" at each firing of neuron zl (see below) one needs, however, for the latter option (EPSPs versus (3J more specific assumptions about these functions to control undesired side-effects that may result from the end segments of EPSPs that caused the preceding firing of zl. We also would like to point out that the full power of the module ) Section 2.6 is actually not needed if one just wants COMPARE(2 t ~ from to simulate Turing machines on an SNN. If one employs a less concise encoding of bit strings by assuming also that b2r = 0 for all i 5 l / 2 for all fiiiitr bit strings (bl. . . . b e ) that are encoded in the phase difference 9 = XI=, P bI . . 2-I-c of an oscillator, it is guaranteed that p 2 2-'-' or p 5 2-2-c (independently of P and of the values of the bi E (0. l}).This "gap" of fixed length between the possible values of p allows us to determine whether bl = 1just with the help of delay and inhibition modules [instead of using the more subtle mechanism of COMPARE(>@)I. But the module for COMPARE(> a ) is of independent interest, since it shows in the context of Section 3 that discontinuous real-valued functions can also be computed on an SNN. The implicit assumptions about the firing mechanism of neurons in the version of the SNN model from Section 1 ignore the well-known "reset" and "adaptation" phenomena of neurons. However, one can easily adjust the definition of the SNN model so that it also takes these features into account. To model a reset of a neuron at its moment of firing, one can adjust the definition of the set F,, of firing times of a neuron u by deleting (or modifying) in the definition of P , ( t ) those EPSPs and IPSPs from presynaptic neurons 11 that had already arrived at u before the most recent firing of u. Adaptation of a neuron zi refers to the observation that the firing-rate of a biological neuron may decline after a while even if the incoming
Computational Power of Networks of Spiking Neurons
35
excitation [i.e., P,(t)] remains at a constant high level (see for example Kandel et al. 1991). This effect can be reflected in the SNN model by replacing the term O,(t - s) in the definition of the set F , of firing times by a sum over O,(t - s) for several recent firing times s E F,, [and by assuming that O,(x) returns only relatively slowly to its initial value
Wyl. We would like to point out that all of our constructions in Sections 2 and 3 are compatible with our above-mentioned changes in the SNN mode1 for modeling the reset and adaptation of neurons. The reason for this is that we can arrange in the constructions of Sections 2 and 3 that all "relevant" firings of a neuron v are spaced so far apart that reset and adaption of u have no effect on those critical firing times. Regarding the simulation of threshold circuits by SNNs (see Section 2.5) we would like to point out that the corresponding SNN module can be constructed with fewer neurons if one makes further assumptions about the shape of EPSP and IPSP response functions. For example, one can simulate directly a threshold gate TE with weights (Y, of different sign in a similar way as we have simulated monotone threshold gates T" in Section 2.5, provided that the EPSPs (modeling inputs with positive weights) and IPSPs (modeling inputs with negative weights) move linearly within the same time span from 0 to their extremal values. Finally, we would like to point out that the class of piecewise constant functions (i.e., the class of step-functions) provides an example for a class of response and threshold functions that do not satisfy our basic assumptions from Section 2, but that can still be used to build for any Turing machine M an SNN N M f that can simulate M (although not in real-time). We assume here that the response functions are piecewise constant (but not identically zero), and that the threshold functions are arbitrary functions (e.g., piecewise constant) that satisfy condition (1) of our basic assumptions. One can then build oscillators, as well as delay, inhibition, and synchronization modules, in the same way as in Section 2, and one can also simulate arbitrary threshold circuits in the same way. Furthermore one can use the phase difference between an oscillator 0 with the same oscillation period TPM as the pacemaker I'M to simulate a counter. For that purpose one employs a delay module D with a suitable delay p > 0 (so that k . p = !. T P M for any k. 1 E N implies that k = L = 0). One can then use the phase difference between 0 and PM to record how often the "spike in 0 has been directed in the course of the computation through this delay module D.Hence one can store in the SNN an arbitrary natural number k, which can be incremented and decremented by suitable modules. To decide whether k = 0, one needs a module that can carry out a special case of the operation COMPARE. Such a module cannot be built in the same way as in Sections 2 and 3, but one can employ directly the "jump" in the piecewise constant response functions considered here to test whether two neurons fire exactly at the same time
Wolfgang Maass
36
It is well known (see Hopcroft and Ullmari 1979) that any Turing machine M can be siniulated (although not in real-time) by a machine M' that has no tapes or stacks, but two counters. The preceding argument shows that such an M' (in fact, a machine with any finite number of counters) can be simulated in real-time by some finite SNN ,\:MI with piecewise comtaiit response and threshold functions. The effect of the shape of postsynaptic potentials on the coniputational power of networks of spiking neurons is investigated more thoroughly in Maass and Ruf (1995). It is shown there that computations with single spikes in networks of spiking neurons become substantially slower if they cannot make use of 11icrc77siiig and ilecrensi~i,ylinear segments of EPSPs. 5 Conclusion
--
We have analyzed the computational power of a simple formal model SNN for networks of spiking neurons. I n particular we have shown tliat if the response and threshold functions o f the SNN satisfy some rather weak basis assumptions (see Section 2), then one can build modules that can syiiclir.oiii:c the spiking of different network parts, as well as modules that can i n i ~ / f i p / the y phase difference between two oscillators with any given constant, a n d add, siihtrnc?, or roiiipnrc the phase differences of different oscillators (see tlie constructions in Sections 2 and 3 ) . With the help of these quite powerful computational operations an S N N can simulate in real-time for Boolean-valued input a n y Turing machine, and for real-valued input any N-RAM (a slightly weaker version of tlie model of Blum ct 01. 1989; see Section 3 of this article). On tlie side we would like to niention that these results also yield lower bounds for tlie VC-dimension of networks of spiking neurons, hence for tlie number of training examples needed for learning by such networks (see Maass 1994b, 1995~).One immediate consequence of this type is indicated in Corollary 2.2 of this article. The version of tlie model SNN ivitli unlimited timing-precision ( i t . , T = R' in the definition in Section 1) is not biologically realistic, insofar as it does not take tlie effects of noise into account. From that point of view our alternative version of this model with discrc~h~ firing times from : i E N} for sonie > 0 is preferrable (since it allows us t o represent {i. an imprecise firing anyivliere in the time interval ( i - :. i . 1' -t E) in the biological system by a "symbolic" firing at time i . p ) . Therefore it is important to note that our results about SNNs with unlimited timing precision iiidirie corresponding results for the computational power of SNNs ivitli discrc~trfiring times, as we Iiave indicated in Corollary 2.5 (see Theorem 5 in Maass 1994b, as well a s Maass 1995c, for further consequences of our results for SNNs with limited timing precision). In addition our constructions of SNN modules for the operations ADD, SUBTRACT, and MULTIPLY( jl on time differences between spikes appear to be quite
Computational Power of Networks of Spiking Neurons
37
robust, in the sense that they provide approximate implementations of these operations on time differences between spikes in various models for real-valued computations in networks of spiking neurons with noise. We refer to Maass (1995d) for further results about the computational power of SNNs with noise. The results of this article have two interesting consequences. One is that in order to show that a network of spiking neurons can carry out some specific task (e.g., in pattern recognition or pattern segmentation, or solving some binding problem; see, eg., von der Malsburg and Schneider 1986, or Gerstner e f a/. 1993) it now suffices to show that a threshold circuit, a finite automaton, a Turing machine, or an N-RAM (see Section 3) can carry out that task in an efficient manner. Furthermore the simulation results of this article allow us to relate the computational resources that are needed on the latter more convenient models (e.g., the required work space on a Turing machine) to the required resources needed by the SNN (e.g., the timing precision of the SNN, see Corollary 2.5). In other words, one may view N-RAMSand the other mentioned common computational models as “higher programming languages” for the construction of networks of spiking neurons. The real-time simulation methods of this article exhibit automatic methods for translating any program that is written in such higher programming language into the construction of a corresponding SNN. In this way the ”user” of an SNN may choose to ignore all worrisome implementation details on SNNs such as timing (potentially at the cost of some efficiency). Furthermore the matching upper bound result for N-RAMS (see Maass 1995b,c) shows that the corresponding ”higher programming language” is able to exploit all computational abilities of SNNs. Second, in combination with the corresponding upper bound results for SNNs with quite arbitrary response and threshold functions (and time-dependent weights) in Maass (1995b,c), the lower bounds of this article provide for a large class of response and threshold functions exact characterizations (up to real-time simulations) of the computational power of SNNs with real valued inputs, and for SNNs with bounded timing precision. As a consequence of these results, one can then also relate the computational power of SNNs to that of recurrent analog neural nets with various activation functions (see Section 3), thereby throwing some light on the relationships between the computational power of models of neurons with spike coding (SNNs) and models of neurons with frequency coding (analog neural nets). Furthermore, the combination of these lower and upper bound results shows that extremely simple response and threshold functions (such as, for example, those in Fig. 2 in Section 2) are universal in the sense that with these functions an SNN can simulate in real-time any SNN that employs arbitrary piecewise linear response and threshold functions. Equivalence results of this type induce some structure in the “zoo” of response and threshold functions that are mathematically interesting or occur in biological neural systems,
Wolfgang Maass
38
a n d they allow us to focus o n those aspects of these functions that are c.;srntin/ for the computational power of spiking neurons. Finally w e would like to point out that since we have based all of o u r investigations on the rather fine notion of a rd-tirric siriiuhtioii (see Section 11, our results provide information not just about the relationships between the coiiiprtilfioiinl power of the previously mentioned models for neural networks, but also about their capability to execute Icl7rriing algorithms (i.e., about their ndnptiile qualities).
Acknowledgments
-
.~
I ~voulcllike t o thank Wulfrani Gerstner, John G. Taylor, a n d three anonymous referees for helpful comments.
References
-
Abeles, M. 1991, C o f f j m i f t - sN w r d Cirsitifs o f fhc C c r t ~ / dCortr’*. Cainbridge University Press, Cambridge, England. Aertsen, A,,ecl. 1993. Brniii T / I L Y J ~S !~/i :[ 7 f I o - T L ’ r i i p o iAspccfs ~~/ of Brniri Firiictfori. Elsevier, Amsterdam. Hum, L., Sliub, M., and Sninlc, S. 1989. On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions ancl uni\wsal machines. Birll. Arri. Muflr. Soc. 21(1), 1-46, Bulimann, J , , ancl Schulten, K. 1986. Associative rccognition and storage in a model network ot physiological neurons. B i d . C,i/l~~rii. 54, 319-335. Churchland, P. S., and Sejnowski, T. J. 1992. T h Ct~riipiitcitiiiriaIB r ~ I r i .MIT Press, Cambridge, M A . Crair, M. C., and Bialek, W. 1990. Non-Boltzmann dynaniics in networks of spiking neurons. A h i r i c c r irf Ntpirrn/Iiiforxinfiorf ProccsIrig Systeiris, Vol. 2, 109-116. Morgan Kaufmann, San Mateo, CA. Cerstner, W. 1991, Associative memory i n a network of ”biological” neurons. Aifzliiiizt~siri N c u d lriforiiiofIoii Prvc-mi~igSysfeiiir, Vol. 3, pp. 84-90. Morgan Kaufmann, San Mateo, CA. Gerstner, W. 1995. Time structure of the actilrity i n neural network models. Plys. Rczi. P 51, 738-758. Cerstner, W., and van Hemmen, J. L. 1994. How to describe neuronal activity: Spikes, rates, or assemblies? A t f i w i s r s iiz Nrirrnl lrijoririnfiori Prcwssiiig Sysfetiis, Vol. 6, pp. 463470. Morgan Kaufmann, San Mateo, CA. Gerstner, W., Ritz, I<., and van Hemmen, J. L. 1993. A biologically motivated and analytically soluble model of collectiw oscillations in the cortex. Biol. Cybwr. 68, 363-374. Hajnal, A,,Maass, W., Pudlak, P., Szegedy, M., and Turan, G. 1993. Threshold circuits of bounded depth. 1. Corript. S!ystrrii ScI. 46, 129-154. Hopcroft, J . E., and Ullman, J. D. 1979. Irifroilirctiori to AirfomfnTliror!y, Lnrzgirnges, iirfif C t l r l f ~ J i l t ~ 7 f lAddison-Wesley, ~ri. Reading, MA.
Computational Power of Networks of Spiking Neurons
39
Horne, B. G., and Hush, D. R. 1994. Bounds on the complexity of recurrent neural network implementations of finite state machines. Advances in Neural Information Processing Systems, Vol. 6, 359-366. Morgan Kaufmann, San Mateo, CA. Judd, K. T., and Aihara, K. 1993. Pulse propagation networks: A neural network model that uses temporal coding by action potentials. Neural Networks 6, 203-215. Kandel, E. R., Schwartz, J. H., and Jessel, T. M. 1991. Principles of Neirra/ Science. Prentice-Hall, Englewood Cliffs, NJ. Koiran, P. 1993. A weak version of the Blum, Shub, Smale model. In Proceedings of the 34th Annual I E E E Symposium on Foundations of Computer Science, pp. 486495. IEEE Computer Society Press, Los Alarnitos, CA. Lapicque, L. 1907. Recherches quantitatives sur l'excitation electrique des nerfs traitee comme une polarization. I. Physiol. Pafhol. Gen. 9, 620-635. Lewis, H. R., and Papadimitriou, C. H. 1981. Elements of the Theory of Computation. Prentice-Hall, Englewood Cliffs, NJ. Lupanov, 0 . B. 1973. On circuits of threshold elements. Dokl. Akad. Nauk S S S X , Vol. 202, 1288-1291; Engl. translation in Problemy Kibernetiki, Vol. 26, 109-140. Maass, W. 1985. Combinatiorial lower bound arguments for deterministic and nondeterministic Turing machines. Trans. A m . Math. Sac. 292, 675-693. Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets. Proc. 25th Ann. ACM Symp. Theory Computing 335-344. Maass, W. 1994a. Neural nets with superlinear VC-dimension. In Proceedings of the European Conference o n Artificial Neural Networks 1994 (ICANN '94); journal version appeared in Neural Comp. 6, 875-882. Maass, W. 1994b. On the Computational Complexity of Networks of Spiking Neuraris (extended abstract) Tech. Rep. 393 from May 1994 of the Institutes for Information Processing Graz. Advances in NeuraI Information Processing Systems, Vol. 7, 183-190. MIT Press, Cambridge, MA. Maass, W. 199%. Vapnik-Chervonenkis dimension of neural nets. In Handbook of Brain Theory and Neural Networks, pp. 1000-1003, M.A. Arbib, ed., MIT Press, Cambridge, MA. Maass, W. 199510. Analog computations on networks of spiking neurons (extended abstract); appears in Proc. 7th Italian Workshop on Neural Nets 1995, World Scientific Press. Maass, W. 1995c. Upper bounds for the computational power of networks of spiking neurons (in preparation). Maass, W. 1995d. On the computational power of noisy spiking neurons. Tech. Rep. 412 (May 1995) of the Institutes for Information Processing, Graz, Austria; appears in Advances in Neural Information Processing Systems, Vol. 8 (1996). Maass, W., and Ruf, B. 1995. Consequenes of the shape of postsynaptic potentials for the computational power of networks of spiking neurons; appears in Proc. lnternational Conference on Artificial Neural Networks ( I C A N N '95),Paris. Maass, W., Schnitger, G., and Szemeredi, E. 1987. Two tapes are better than one for off-line Turing machines. Proc. 29th Ann. ACM Symp. Theory Computing 94-100.
Wolfgang Maass
40
Murray, A,, and Tarasscnko, L. 1994. Arrnlcyiii' Ni7irrnl VLSI: A Piilsr Sfrctirri App ~ i s l i Chapman . & Hall, London. Siegelmann, H. T., and Sontag, E. D. 1992. O n the coiiiputational power of neural nets. h c . J f \ i ./'ICM-\~or.k.;/io;JCorrip. J!.L'III.IIIII~ T\IcLJ~!/ 440449. Tuckwell, H. C. 1988. f i i f r ~ i d ~ t i otoi iT\wor.lVisdNriir.obIal(i~y,Vols. 1 and 2. Cambridge Universitl; I'ress, Cambridge, England. Valiant, L. G. 1994. Circxits fiir M i r i d . Oxford Univcrsitv Press, Oxford, England. Vapnik, V. N., and Chervonenkis, A.Y. 1971. O n the uniform convergence of relati1.e trecluencies of m'ents t o their probabilities. Tlicwy P ~ i b Appl. . 16, 261-28(?. \'on der Malsburg, C., and Schneider, LV. 1986. A neural cocktail-party processor. Riol. C,iibci.ri. 54, 2Y-40. Watts, L. 1994. Ewnt-cirkvn simulation of networks of spiking neurons. Adm r i m iri Niwriil frifi~rrritztiori Prorcssirig S!/stivris, Vol. 6 , pp. 927-934. Morgan Kaufrnmn, San Mateo, CA.
l
25, lY'I4
accepted April 5 , l Y 9 5
This article has been cited by: 1. Thomas Voegtlin. 2009. Adaptive Synchronization of Activities in a Recurrent NetworkAdaptive Synchronization of Activities in a Recurrent Network. Neural Computation 21:6, 1749-1775. [Abstract] [Full Text] [PDF] [PDF Plus] 2. Michiel D'Haene, Benjamin Schrauwen, Jan Van Campenhout, Dirk Stroobandt. 2009. Accelerating Event-Driven Simulation of Spiking Neurons with Multiple Synaptic Time ConstantsAccelerating Event-Driven Simulation of Spiking Neurons with Multiple Synaptic Time Constants. Neural Computation 21:4, 1068-1099. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Sam McKennoch, Thomas Voegtlin, Linda Bushnell. 2009. Spike-Timing Error Backpropagation in Theta Neuron NetworksSpike-Timing Error Backpropagation in Theta Neuron Networks. Neural Computation 21:1, 9-45. [Abstract] [Full Text] [PDF] [PDF Plus] 4. Peter Tiňo , Ashely J. S. Mills . 2006. Learning Beyond Finite Memory in Recurrent Networks of Spiking NeuronsLearning Beyond Finite Memory in Recurrent Networks of Spiking Neurons. Neural Computation 18:3, 591-613. [Abstract] [PDF] [PDF Plus] 5. Jiří Šíma , Pekka Orponen . 2003. General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic ResultsGeneral-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results. Neural Computation 15:12, 2727-2778. [Abstract] [PDF] [PDF Plus] 6. Wolfgang Maass , Thomas Natschläger , Henry Markram . 2002. Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on PerturbationsReal-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations. Neural Computation 14:11, 2531-2560. [Abstract] [PDF] [PDF Plus] 7. A. N. Burkitt , G. M. Clark . 2000. Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic InputsCalculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs. Neural Computation 12:8, 1789-1820. [Abstract] [PDF] [PDF Plus] 8. Masahiko Yoshioka, Masatoshi Shiino. 2000. Associative memory storing an extensive number of patterns based on a network of oscillators with distributed natural frequencies in the presence of external white noise. Physical Review E 61:5, 4732-4744. [CrossRef] 9. A. N. Burkitt , G. M. Clark . 1999. Analysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike OutputAnalysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike Output. Neural Computation 11:4, 871-901. [Abstract] [PDF] [PDF Plus]
10. Masahiko Yoshioka, Masatoshi Shiino. 1998. Associative memory based on synchronized firing of spiking neurons with time-delayed interactions. Physical Review E 58:3, 3628-3639. [CrossRef] 11. Terence David Sanger . 1998. Probability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking NeuronsProbability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking Neurons. Neural Computation 10:6, 1567-1586. [Abstract] [PDF] [PDF Plus] 12. Wolfgang Maass , Pekka Orponen . 1998. On the Effect of Analog Noise in Discrete-Time Analog ComputationsOn the Effect of Analog Noise in Discrete-Time Analog Computations. Neural Computation 10:5, 1071-1095. [Abstract] [PDF] [PDF Plus] 13. B. Ruf, M. Schmitt. 1998. Self-organization of spiking neurons using action potential timing. IEEE Transactions on Neural Networks 9:3, 575-578. [CrossRef] 14. Eduardo D. Sontag . 1997. Shattering All Sets of ‘k’ Points in “General Position” Requires (k — 1)/2 ParametersShattering All Sets of ‘k’ Points in “General Position” Requires (k — 1)/2 Parameters. Neural Computation 9:2, 337-348. [Abstract] [PDF] [PDF Plus] 15. Xing Pei, Lon Wilkens, Frank Moss. 1996. Noise-Mediated Spike Timing Precision from Aperiodic Stimuli in an Array of Hodgekin-Huxley-Type Neurons. Physical Review Letters 77:22, 4679-4682. [CrossRef]
Communicated by Richard Lippmann
NOTE
A Short Proof of the Posterior Probability Property of Classifier Neural Networks Raul Rojas lnstitut fur Informatik, Freie Universitut Berlin, Takustr. 9,24195 Berlin, Germany
It is now well known that neural classifiers can learn to compute a posteriori probabilities of classes in input space. This note offers a shorter proof than the traditional ones. Only one class has to be considered and straightforward minimization of the error function provides the main result. The method can be extended to any kind of differentiable error function. We also present a simple visual proof of the same theorem, which stresses the fact that the network must be perfectly trained and have enough plasticity. It is now well known that neural networks trained to classify an ndimensional input x in one out of M classes can actually learn to compute the Bayesian a posteriori probabilities that the input x belongs to each class. Several proofs of this fact, differing only in the details, have been published (Bourlard and Morgan 1993; Richard and Lippmann 19911, but they can be simplified. In this note we offer a shorter proof of the probability property of classifier neural networks. Figure l a shows the main idea of the proof. Points in an input space are classified as belonging to a class A or its complement. This is the first simplification: we do not have to deal with more than one class. In classifier networks, there is one output line for each class C,, i = 1, . . . .M. Each output C, is trained to produce a 1 when the input belongs to class i and otherwise a 0. As the expected total error is the sum of the expected individual errors of each output, we can minimize the expected individual errors independently. This means that we need to consider only one output line and when it should produce a 1 or a 0. Assume that input space is divided into a lattice of differential volumes of size dv, each one centered at the n-dimensional point v. If at the output representing class A the network computes the value y(v) E [O. 11 for any point x in the differential volume V ( v )centered at u, and denoting by p ( u ) the probability p[A I x E V ( v ) ]then , the total expected quadratic error is
Neural Computation 8, 41-43 (1996)
@ 1995 Massachusetts Institute of Technology
Raul Rojas
42
if input in class A
Figure 1: The output y(u) in a differential volume. where the sum runs over all differential volumes in the lattice. Assume that the values y(z7)can be computed independently for each differential volume. This means that we can independently minimize each of the terms of the sum. This is done by differentiating each term with respect to the output y ( u ) and equating the result to zero
+
-2p(v)[l - y(o)] 2[1 - p(z7)]y(v)= 0 From this expression we deduce p ( u ) = y(z7), that is the output y(z7) which minimizes the error in the differential region centered at u is the a posteriori probability p ( u ) . In this case the expected error is
m[1- ~
( 4+111~- P(~)IY(z~ = p(v)[l - P(U)I
and EA becomes the expected variance of the output line for class A . Note that extending the above analysis to other kinds of error functions is straightforward. For example, if the error at the output is measured by log[l - y(v)] when the desired output is l and log[y(.si)]when it is 0, then the terms in the sum of expected differential errors have the form
I.’(u)logP
-
Y(V)I
+
[1 p(z,)I log[y(zl)l ~
Differentiating and equating to zero we again find y(u) = p(z7). This short proof also strongly underlines the two conditions needed for neural networks to produce a posteriori probabilities, namely perfect tmiizirzg and eizoiigh plasticity of the network, so as to be able to approximate the patch of probabilities given by the lattice of differential volumes and the values y(ZJ), which we optimize independently of each other.
Posterior Probability Property
43
It is still possible to offer a simpler visual proof ”without words” of the Bayesian property of classifier networks, as is done in Figure lb. When training to produce 1 for the class A and 0 for A‘, we subject the function produced by the network to an “upward force” proportional to the derivative of the error function, i.e., [l-y(v)], and the probability p(v), and a downward force proportional to y(v) and the probability [l -p(v)]. Both forces are in equilibrium when p ( v ) = y(v).
References Bourlard, H., and Morgan, N. 1993. Connectionist Speech Recognition. Kluwer Academic, Boston. Richard, M. D., and Lippmann, R. P. 1991. Neural network classifiers estimate a posteriori probabilities. Neural Cornp. 3(4),461483.
Received August 19, 1994; accepted April 5,1995.
This article has been cited by: 1. Tae Hyung Kim, Il Kyu Eom, Yoo Shin Kim. 2009. Multiscale Bayesian texture segmentation using neural networks and Markov random fields. Neural Computing and Applications 18:2, 141-155. [CrossRef] 2. Wolfgang Utschick , Werner Weichselberger . 2001. Stochastic Organization of Output Codes in Multiclass Learning ProblemsStochastic Organization of Output Codes in Multiclass Learning Problems. Neural Computation 13:5, 1065-1102. [Abstract] [PDF] [PDF Plus] 3. M. Saerens. 2000. Building cost functions minimizing to some summary statistics. IEEE Transactions on Neural Networks 11:6, 1263-1271. [CrossRef]
Communicated by Anthony Zador
~
Coding of Time-Varying Signals in Spike Trains of Integrate-and-Fire Neurons with Random Threshold
Recently, methods of statistical estimation theory have been applied by Bialek and collaborators (1991) to reconstruct time-varying velocity signals and to investigate the processing of visual information by a directionally selective motion detector in the fly's visual system, the H1 cell. We summarize here our theoretical results obtained by studying these reconstructions starting from a simple model of H1 based on experimental data. Under additional technical assumptions, we derive a closed expression for the Fourier transform of the optimal reconstruction filter in terms of the statistics of the stimulus and the characteristics of the model neuron, such as its firing rate. It is shown that linear reconstruction filters will change in a nontrivial way if the statistics of the signal or the mean firing rate of the cell changes. Analytical expressions are then derived for the mean square error in the reconstructions and the lower bound on the rate of information transmission that was estimated experimentally by Bialek et al. (1991). For plausible values of the parameters, the model is in qualitative agreement with experimental data. We show that the rate of information transmission and mean square error represent different measures of the reconstructions: in particular, satisfactory reconstructions in terms of the mean square error can be achieved only using stimuli that are matched to the properties of the recorded cell. Finally, it is shown that at least for the class of models presented here, reconstruction methods can be understood as a generalization of the more familiar reverse-correlation technique. 1 Introduction
_
_
_
~
In many animals auditory and visual stimuli are detected by the peripliera1 sensory system and processed centrally with an appreciable temporal resolution, ranging from typical reaction times of 30 msec in free-flying flies down to the discrimination of time delays in the microsecond range for sound location in barn oivls and for echolocating bats (Land and Collett 1974; Simmons 1979; Moiseff and Konishi 1981). The coding of such information in neuronal spike trains and its subsequent elaboration are
Coding of Time-Varying Signals in Spike Trains
45
subjects that are still poorly understood. While single neurons might transmit information primarily through their mean firing rate (Adrian 1928), it has also been emphasized that additional information could be encoded in the temporal structure of neuronal spike trains (see for example, Poggio and Viernstein 1964; Miller and Sachs 1983; Optican and Richmond 1987; Bialek e t a / . 1991; Softky and Koch 1993; Middlebrooks ef al. 1994). Collective phenomena in large groups of neurons such as synchronization or temporal oscillations in the correlation of multiple-unit firing rates might as well contribute to the processing of information in central nervous pathways (Eckhorn et al. 1988; Gray et al. 1989). But the relation between single cell codes and neuronal population codes remains unclear. In this context, it is of interest to study how much information the spike train of a single cell is able to carry about a time-varying stimulus. One approach to investigate this problem consists in presenting repeatedly to the animal a stimulus drawn from an ensemble with known probability distribution while recording the evoked spike trains from a single neuron. By applying methods of stochastic estimation theory (Wiener 1949; Saleh 1978; Poor 1994), it is then possible to compute a temporal filter k ( t ) that, when convolved with the spike train of the neuron in response to a stimulus s ( t ) , will produce an estimate seSt(t) of s ( t ) . In other words, it is possible to reconstruct part of the time-course of the stimulus from the spike train. Such methods have been introduced into neurobiology by Bialek and collaborators to study the transmission of information by peripheral sensory neurons in a variety of preparations (Bialek et al. 1991; Rieke et a/.1993). These authors have computed a lower bound on the rate of information transmitted by seSt(t)about s ( t ) . In the case of the visual system of the fly, a wide-field horizontal cell of the lobula plate (in the third-order optic ganglia of the fly), the H1 cell, is capable of transmitting at least 32 bit/sec of information about a time-varying velocity stimulus presented to the fly. The reconstructions reported by Bialek et a / . (1991) and Rieke et a/. (1993) raise several interesting questions. One would first like to know the biological significance of the filter k : it has been suggested that similar algorithms might be implemented at the level of single neurons to decode presynaptic spike trains (Bialek et a/. 1991). A related problem is to clarify the connection to neuronal encoding mechanisms as well as to the underlying biophysics of single cells. For example, it is not clear how the properties of the recorded neuron, such as its firing rate, affect estimates of the filter h(t). One would also like to know the relationship between the rate of information transmission of the neurons, as estimated in previous studies, and the measure quantifying the qnality of reconstructions, the mean square error. Several issues related to these questions have been addressed in earlier theoretical work (Bialek 1989; Bialek and Zee 1990; Bialek 1992), but the models considered did not allow an explicit calculation of the reconstruction filter k . We report here
Fabrizin Gabbiani and Christof Koch
46
that by using different models, it is possible to obtain a closed formula for the reconstruction filter and to compare directly theoretical results with earlier experimental work. More technical details and further theoretical results are contained in Gabbiani (1995). The rest of this paper is organized as follows: in Section 2, we describe the linear reconstruction of signak from spike trains and derive formulas for the rate of information transmission and the mean square error in the reconstructions. Section 3 introduces the class of models that we are studying, while in Section 4 we present results obtained with a simplified version of these models. Finally, we discuss our results in Section 5. 2 Linear Estimation of Time-Varying Signals from Neuronal Spike Trains
Let s , , ( t ) be a gaussian random stimulus with finite variance that is presented to the animal. In the experiments performed by Bialek e t a / . (1991), S,I( t ) was the velocity of a random pattern presented to the fly:it consisted in a zero mean gaussian white noise signal with cut-off frequency f c = 1OOOHz and a standard deviation of 132deg/sec. Let rl,(t)= C, b(f - t i ) be the spike train recorded from the cell. We assume that s c i ( t )is bandwidth limited with angular cut-off frequency = 27fc and that s , l ( t ) and r , i ( f i are jointlv (weakly) stationary. Let 5,) and .I-" be the mean values of st,(f I and s , l ( t ), respectively. In the following we will consider the stimulus and spike train with their mean value subtracted,' s(f ) = s,)( t ) - s,] and s/t ) = s,,( f ) - rl).We write down a linear estimate sesi(t)of the stimulus s(t ) given the spike train by setting s,,tit)
=
! .l l t , I l ( t , ) r ( t - t , i
The linear filter II is to be chosen in such a way as to minimize the mean square error between the stimulus and the estimate of the stimulus, F(Il)z =
( / S ( t ) - scitjt)jl>;.
(2.1)
where the brackets ( ' ) mean average over the presented stimuli and recorded spike trains. If we define the cross-correlation between the stimulus and spike train, R5,jr) = ( s ( t i r ( tt 7 ) )
the autocorrelation function of the spike train,
R , , ( T )= ( ~ ( ~t ) (+tT ) ) ' l i the stiinulus or the spike train sample functions contam other deterministic coniponents, these need to be subtracted as well (see Wiener 1949, sect. 2.4).
Coding of Time-Varying Signals in Spike Trains
47
and their Fourier transform' through
S,,(w)
= /dTR,,(T)
elwT%
S,,(J)
= /dTX,,(T)
el"',
it follows from the orthogonality principle (Poor 1994, sect. V.D.l) that the Fourier transform of the optimal linear filter h ( f ) , A
k(w) =
.Id t h ( t )
el'".
is given by3
In practice, the Fourier transform h^(w) is approximated numerically by the discrete Fourier transform and Ssx(w), S,,(w) are computed by replacing averages over the stimulus ensemble by time-averaging over a single sample of the ensemble (Oppenheim and Schaffer 1975; Rabiner and Gold 1975). Since the filter h is derived by minimizing the mean square error between the stimulus and estimated stimulus (see equation 2.1),the natural measure for the quality of reconstructions is the mean square error, t2. Experimentally, a numerically accurate estimate of t2 is obtained from
where s k and seStk are the sample points of the stimulus and estimated stimulus, respectively. To gain theoretical insight in the dependence of F~ on the choice of the stimulus ensemble, it is convenient to define the "noise" contaminating the reconstructions as the difference between sebt(f) and s ( t ) ,
n ( t ) = S,,t(f)
-sit)
* sest(t)= s ( t ) + n ( t ) .
The mean square error in the reconstructions is then given by
=
1
dw S,,(W),
-
27r
-wc
is the power spectrum of the spike train. do not impose a causality constraint on the filter k , since this requires solving the causal Wiener-Hopf equation (see Wiener 1949; Poor 1994, sect. V.D.2). Practically, h has a finite support in the time domain and causality could also be implemented by introducing a delay in the reconstructions (Rieke 1991). 2S,(w)
Fabrizio Gabbiani and Christof Koch
48
where S,,ll(w) is the power spectrum of the noise. From the definition of the noise,
In this latter equation, S,,(w) is the power spectrum of the stimulus ensemble. If we define the signal-to-noise ratio as
we may rewrite
It follows from this latter equation that the mean square error in the reconstructions depends on the signal-to-noise ratio as well as on the bandwidth of the signal. The larger the signal-to-noise ratio, the smaller the mean square error. If, on the other hand, the signal-to-noise ratio is equal to 1 in some frequency band A, S N R ( w )= 1
for w E A
then the entire power of the signal in this frequency band contributes to the mean square error:
(2.4) In the extreme case where the spike train is completely unrelated to the signal,
SNR(w)= 1
for all w ,IwI 5 w,,
it follows that the mean square error coincides with the variance the signal, f-
=
~
271
J.YI dwS,,(w)
=
( s ( t )2 )
g2
of
= cr2
-wc
From this we conclude that the relative mean error, defined as E
Fr = - -
(2.5)
0
is an appropriate measure of the quality of the reconstructions in the time domain, with E , = 1 if the reconstructions are not better than chance and F, + 0 in the limit of perfect reconstructions. In earlier work on the subject, the reconstructions were quantified by using a different measure, the rate of information transmitted by seSt( t)
Coding of Time-Varying Signals in Spike Trains
49
about s ( t ) (Bialek ef al. 1991; Rieke et al. 1993). The quantity computed in Bialek et al. (1991, equation 2) and Rieke ef al. (1993, equation 3) is not an exact estimate of the mutual information rate I(seSf;sl4between s ( t ) and seSt(t)but rather a lower bound, ILB,
I(Sest;s) 2 ILB and the true rate of information transmission I(seSt;s)could take any value greater than or equal to ILB. It is possible to show that ILB can be computed by using the simple formula,
ILB =
Iw' dw log [ S N R ( u ) ]
4 ~ 1 0 g ( 2 )-wc
(in bit/sec).
(2.6)
This avoids altogether the computation of the gain g(w)introduced in Bialek etal. (1991) and Rieke et al. (1993) and thus simplifies and improves the numerical estimate of 1LB.5 Complete arguments regarding these two assertions can be found in Gabbiani (1995, section 3 and appendix I). As mentioned in the Introduction, in the case of the fly H1 neuron ILB = 32 bit/sec. The lower bound ILB depends on the signal-to-noise ratio, but does not directly depend on the bandwidth of the signal. The frequency band
a = { u I SNR(w) > l} contributes to information transmission whereas no information is transmitted if S N R ( u ) = 1. It is hence clear from equations 2.3 and 2.6 that the mean square error and ILB represent different measures of the reconstructions. This will be further illustrated in the numerical example of Section 4. A lower bound on the rate of information transmission per spike, Is, can be obtained by dividing by the firing rate X of the cell:
Is = ILB -
x
(in bit/spike).
3 A Simplified Model of Motion Encoding in H1 Neurons
The lobula plate is a third-order neuropil in the visual system of the fly, which is believed to be a major motion computation center involved in visual course stabilization as well as visual fixation and discrimination of objects (Hausen and Egelhaaf 1989). In addition to small retinotopic elements, the lobula plate contains giant tangential cells that integrate motion signals over a large portion of the visual hemifields (Hausen ~
4 1 ( ~ , , t ;is ~ )itself a lower bound on the rate of information transmitted by the spike trains on the stimulus, see for example Rieke et al. (1993). jThe power spectrum of the stochastic noise process used in Bialek ef al. (1991) and Rieke et al. (1993) can also be computed without introducing g(w) (see Gabbiani 1995, appendix I).
50
Fabrizio Gabbiani and Christof Koch
1984). H1 is a directionally selective cell belonging to the subclass of horizontal tangential neurons and is sensitive to horizontal back-to-front motion presented to the ipsilateral eye.6 It is believed that H1 is part of the neural circuitry underlying the optomotor response of flies (Eckert 1980), although its role is not clear since H1 is a heterolateral element connecting the two lobula plates and since it does not directly contact descending neurons that project to the motor centers of the thoracic ganglia (Strausfeld and Bassemir 1985). Nevertheless, the mean response of H1 has been shown to encode reliably the velocity contrast (or equivalently, contrast frequency contrast) of sinusoidal moving luminance patterns, at least at low modulation frequencies (Jian and Horridge 1991). The stochastic activity of H1 in response to steady motion of sinusoidal gratings has been described by an integrate-and-fire model with random threshold under a wide range of velocities' (Gestri ct d.1980). This is illustrated in Figure 1A: the stimulus s i t ) is first passed through a filter F to yield the somatic current q,(t) ( > 0 ) (possibly after addition of a background current q , ~ which , describes the spontaneous activity of the cell). This current is then integrated to give the somatic voltage y ( t ) . When y ( t ) reaches the threshold k,,, an action potential is fired and the voltage is reset to zero after an absolute refractory period 6. The thresliold is also reset to a new random value kl drawn from a given probability density distribution y i k ) . In such models, the probability distribution p ( k ) coincides up to constant factors with the interspike interval distribution of the cell in response to steady stimuli. I t has been shown by Gestri ct nl. (1980) that p ( k ) varies with the mean firing rate of the cell (see Fig. 1B) and these authors report a n absolute refractory period in the range of 2-6 msec. The assumption of a random threshold distribution represents a convenient phenomenological description accounting for the trial-totrial variability of H1 spike trains in response to the same stimulus, but it is not meant to imply that variability is due to the spike generating mechanism. Indeed, it is likely that a substantial portion of this noise is presynaptic to H1 (Laughlin 1989; Bialek e t a / . 1991). The integrate-and-fire model with random threshold has two further properties of interest: 1. The standard deviation CT of the interspike interval distribution (after subtraction of the absolute refractory period h , if b # 0) varies linearly with the mean interspike interval of the cell under steady stimuli (i.e., the coefficient of variation of the interspike interval distribution is independent of the mean firing rate). This fact is experimentally supported for H1 (Gestri ct (11. 1980; see also Fig. lB, inset). "Each ily has two H1 iwurons, one in e'icli lohula plate. 'The directionally selective movement detector studied by Gestri cf rrl. (1980) was not identificd anatomically a5 being HI. However the response properties of H1 to visual stimuli are unique among horizontal tangential cells and are hence in principle sufficient to identify it.
Coding of Time-Varying Signals in Spike Trains
F: Filtering of the signal (%(t)ZO) &(t)=q,(t)+q, : somatic current
51
threshold
y(t): membrane voltage
0 = II
u=
8.4 ms
43.4ms
10"
oJ
1
t . do
sb
150
P= 74.2ms
1m2
lo'
.
do
& ' l L time
1oJ
ms
Figure 1: (A) schematic diagram illustrating the integrate-and-fire model with random threshold. The first box represents the filtering of the input signal, which yields the positive current q S ( t )injected into the cell. This current is then added to a current 40 that represents the background activity of the cell and to yield the somatic voltage y ( t ) . When passed through an integrator y(t) reaches threshold a spike is generated, the voltage is reset to zero after an absolute refractory period 6, and the threshold is reset to a new random value. (B) Experimentally measured change of the interspike interval probability density distribution of the H1 neuron with increase in the mean interspike interval (adapted from Gestri et al. 1980). The absolute refractory period 6 = 4 msec has been subtracted. The inset shows the standard deviation (0)of the interspike interval distribution as a function of the mean interspike interval (PI.
(m)
(m)
Fabrizio Gabbiani and Christof Koch
52
2 . The mean firing rate of an integrate-and-fire neuron with random threshold is proportional to the somatic current’ q,(t) (Gestri rt a / . 1980) so that the model can, in principle, take into account the responses to velocity contrast observed in H1 (Jian and Horridge 1991). We will make the following assumptions. Assumption 1. The filter F consists in a linear filtering by some filter K of the velocity signal s(t ) followed by half-wave rectification,
and
with the positive currents 9$(t) each driving one fictive H1 neuron; furthermore, we set 90 = 0. This implements the directional selectivity of the H1 cells. Using an “opponency” principle, the output spike train of the two cells is sit) =
Cb(t- f T ) - C6(t
-
I
t,-).
(3.2)
I
with {tT> encoding positive velocity and {f,-} encoding negative velocity. To gain further insight into the meaning of the filter K, we study the response of the model to a step displacement of a pattern (that is, to a delta velocity pulse). If s ( t ) = h ( t - t o ) then
(3.3)
and, according to property 2, the mean firing rate of one H1 model neuron will be proportional to K ( t - t o ) . Hence, we may identify K ( t ) with the mean firing rate of H1 in response to a step displacement of a pattern at time t = 0. Such experiments have been performed on H1 (Srinivasan 1983) and it turns out that the response of the cell can be approximated (up to a pattern-dependent scale factor p) by an exponential low-pass 8This theoretical result is valid for any mean interspike interval p > b. In practice one expects it to hold true only if p >> b.
Coding of Time-Varying Signals in Spike Trains
53
filter,
K(t) =
i"
pe-'"
ift
2o
(3.4)
with a time constant 7 200msec. However, it has been subsequently shown that if H1 is allowed to adapt to nonzero velocities (or more generally to temporal luminance modulation), the time constant 7 of the filter K shortens by more than one order of magnitude (Maddess and Laughlin 1985; de Ruyter van Steveninck et al. 1986; Borst and Egelhaaf 1987). To take this phenomena into account we assume in the following that 7 = 20 msec. Since H1 integrates spatially the response of elementary motion detectors in response to a pattern moving in front of the animal, its mean firing rate will not in general be proportional to a low-pass filtered version of the horizontal component of the velocity vector, as in our model [see equation 3.1 and property 2 of the integrate-and-fire neurons with random threshold; Borst et d . 19931. Hence, the model analyzed here is not a valid approximation for an arbitrary pattern presented to the fly. However, Reichardt and Schlogl (1988) showed that the output of horizontal elementary motion detectors (computed by assuming a small time delay between the response of two adjacent receptors and by considering only linear terms in the distance between the receptors) will be proportional to the horizontal component of the velocity for a sufficiently smooth pattern having independently distributed luminance along the horizontal and vertical axis (such as for the random pattern used by Bialek et al. 1991). Hence, in this approximation and for such patterns, the spatially integrated response of motion detectors will be proportional to the horizontal component of the velocity, with a pattern-dependent proportionality constant (see equation 3.4). As explained in equation 3.3, we also assume a low-pass filtering of the instantaneous velocity in our model. The range of velocities and velocity modulation for which this approximation is expected to hold will in general be pattern-dependent: it has been investigated by Egelhaaf and Reichardt (1987) for HS cells in the case of sinusoidal gratings. To the best or our knowledge, no equivalent experiments have been yet performed on H1. Assumption 2. We neglect the absolute refractory period (and set 6 in the following).
=
0
Assumption 3. We assume the threshold probability density distribution to be exponentially distributed,
p ( k ) cx eCk Assumptions 2 and 3 represent a convenient first approximation of the response of H1 to velocity signals, since they allow us to compute analytically the optimal decoding filter h for random velocity stimuli, as we report in the next section.
Fabrizio Gabbiani and Christof Koch
54 4 Results
Under Assumptions 1 to 3 of the preceding section, the spike trains of our two model H1 neurons in response to a given velocity sample of the stimulus ensemble are equivalent to nonhomogeneous Poisson processes (Gestri 1971). This important simplification allows us to compute analytically the decoding filter c(w)in the frequency domain. If x(t) is given by equation 3.2, we obtain,
(4.1) where
and
is the mean firing rate of both neurons (mean firing rate per neuron: XK/2). The signal-to-noise ratio in the reconstructions (see Section 2) can be computed from these formulas,
Snm(w) = 1
+ -A1~KK (-W ) 1 2 S , , ( W )
(4.2)
and ILBis then obtained using equation 2.6. For an exponential low-pass filter K ( t ) (see last section) the Fourier transform k ( w ) is given by
K(w) = -. T / ) 1 - iWT A result similar to equation 4.1 has been obtained for the first-order estimation of time-varying signals by saddle-point approximation in a model of a single neuron with an exponential dependence of the mean firing rate on the signal (Bialek 1989; Bialek and Zee 1990). From the formula for h^(w) and S A J X ( w ) several conclusions can be drawn. 1. The optimal linear reconstruction filter and the signal-to-noise ratio depend on the statistics of the stimulus in a nontrivial way. This implies that in our model, a change of the stimulus ensemble in the range of frequencies encoded by the cell will lead to a different filter h and to a different signal-to-noise ratio. 2. The optimal linear filter lz and the signal-to-noise ratio depend on the firing rate of the neuron pair. This is most easily seen by scaling
Coding of Time-Varying Signals in Spike Trains
55
the filter K by a positive constant 7 . If
R(w)
--+
W ( w ) = qR(w).
7
> 0,
then
and (4.4) We see from this last equation that the relative weighting of the two factors XK and /k(u)I’S,,(w) in the denominator depends on r/ (or equivalently on the firing rate of the neurons, see equation 4.3). This implies that as the mean firing rate of the neurons changes, the shape of the decoding filter h will change as well. The second term in equation 4.2 for the signal-to-noise ratio can be shown to depend linearly on the mean firing rate by a similar argument. 3. In the limit of low firing rates, the optimal linear decoding filter h is given by the reverse-correlation of the spike train and the stimulus. This follows from equation 4.4 since as r/ tends to zero, 1 -
i ( T / ) ( w+ ) -K(-w)
S,,(w)
=
1
-Ssx(-u)
AK XK so that Fourier transforming back,
(7) + 0).
This corresponds, up to a constant factor, to the reverse-correlation of the stimulus and the spike train. As illustrated in the following example, these three effects are clearly seen for mean firing rates that are expected to be in the physiological range.
Numerical example. In Figure 2, a gaussian white noise stimulus with cut-off frequency fc = 1000Hz and a standard deviation 0 = 132deg/sec (Bialek et al. 1991) as well as sample spike trains generated by two H1 model neurons in response to the stimulus are shown. The mean firing rate of each neuron was 100 Hz and the time constant of the exponential low-pass filter K was 7 = 20 msec (see Section 3 ) . Figure 3A-C shows the optimal decoding filters computed numerically and from equation 4.1 for the same stimulus and for model neurons firing at 5,50, and 100 Hz. As is clearly seen, the shape of the optimal decoding filter varies with the firing rate. For comparison, Figure 3D shows the decoding filter obtained for two model neurons firing at the same frequency as in Figure 3C (100Hz per neuron), but in response to a white noise signal having a
Fabrizio Gabbiani and Christof Koch
36
C
A degion:
e”
0.004
Gn
0.001
velocity
deg/on:
d&ec‘
time
frequency
HZ
20 ms time
Figure 2: Sample spike trains generated by the model neurons in response to white noise. (A) The upper part o f the graph shows tlie white noise velocity signal a n d the lower part the corresponding action potentials of two simulated, half-wave rectified neurons. Notice the occurrence of closely spaced action potentials (thicker lines) resulting irorn the absence of a refractory period in our model. (B) A portion ot the same signal and spike train as in A is shown at a magnified time scale. (C) Properties of the white noise stimulus s (t ) . The upper graph shows tlie gaussian distribution of velocity (mean value: 0 deg/sec, standard deviation: 132 deg/sec), tlw middle graph the autocorrelation of the white noise, and the l o ~ graph r its one-sided power spectrum (cut-off frequency: 1000 Hz).
Coding of Time-Varying Signals in Spike Trains
A
57
C mean firing rate: 5 Hz
mean firing rate: 100 Hz
deg/sec
7
“1
B
D mean firing rate: 50 Hz
cut-off frequency: 100 Hz mean firing rate: 100 Hz
“1
-
%
4
k
T
7
time
T
7
7
0
“O.60 U
W
-20
time
mS
Figure 3: Comparison of optimal theoretical filters h ( t ) with those obtained from numerical simulations at different firing rates. In simulations, the filters were calculated as explained in Section 2; 100 sweeps of 1 sec each were used to compute S,, and S x x . The smooth curves are the theoretical predictions and were obtained by numerically Fourier transforming equation 4.1 to the time domain. (A, B, and C) Filters obtained at mean firing rates ( X K / ~ ) of 5, 50, and 100 Hz per neuron (cut-off frequency of the white noise signal, fc = 1000 Hz). The high-frequency noise in the numerical calculations decreases as the number of collected action potentials increases. Notice the progressive shortening of the integration window at negative times and the appearance of a negative velocity peak for t > 0. (D) Filter obtained with a white noise signal having a cut-off frequency fc = 100 Hz for two neurons firing each at a frequency of 100 Hz. The high-frequency noise due to the poor encoding of high frequencies by the neuron models in A-C has disappeared.
Fabrizio Gabbiani and Christof Koch
58
cut-off frequency of 100 Hz (standard deviation of the signal in the time domain cr = 132deg/sec). As predicted by equation 4.1, the change in statistics of the signal leads to a different filter than in Figure 3C. Furthermore, the high-frequency noise contaminating the numerical filters of Figure 3A-C has disappeared (an explanation will be given below). The lower bound ILB (in bit/sec) on the rate at which the two neurons transmit information about the stimulus and the relative mean error (F, = @,see Section 2) can be estimated analytically and numerically (Gabbiani 1995, Sect. 6). The analytical calculation leads to (4.5) and
ILB
=
Y
1 27r log(2)
+
2T G a r c t a n
(
rwc
where
y = -n2
)
-
zarctanrq
1
(4.6)
rXK
2 arctanrw,
Notice that in contrast to the mean square error c2, the lower bound I L B does not depend on the standard deviation of the white noise signal in the time domain. Both c, and I L B vary with the mean firing rate (XK) of the two neurons as well as with the cut-off frequency bC) of the stimulus. From equations 4.5 and 4.6 it is easy to see that 'F is a monotonically decreasing function of XK, whereas I L B is a monotonically increasing function of XK. In the limit of low firing rates, XK + 0,
and I L B + 0.
whereas for large firing rates, XK F2 +
0,
E,
--+
-+
03,
0
and 1LB +
co.
The latter limit is, however, not an appropriate approximation for our system: E , and ILB/2 (the rate of information transmitted by a single neuron) are plotted as a function of the firing rate per neuron (XK/2) for a stimulus having a standard deviation cr = 132deg/sec and a cut-off frequency
Coding of Time-Varying Signals in Spike Trains
A
59
B
deglsec 100 1
ll
400
200
100
300
+
-200100
D
C
3w
-
. sllmulus
-n?comlnlcllOn
loo 1
4001 zoo
4oo100 & I
-2wloo
time
F
E
time
rns
bllslsec 0
.
.
fc= 0
o
o
.
20
S 40
60
o
fc=
1000 HZ 100 Hz
a0
firing frequency
1
loo HZ
Ow
o
o
firing frequency
HZ
Figure 4: Comparison of signal and reconstructions before and after filtering with a 5 msec half-width gaussian filter. (A) Original signal. (B) Same signal as in A, but smoothed using a 5 msec half-width gaussian filter. (C) Reconstructed signal; relative mean error Fr = 0.98. (D) Superposition of the filtered signal and filtered reconstruction; relative mean error tr = 0.73. (E, F) Relative mean error and rate of information transmission of each neuron (1LB/2, in bit/sec) as a function of the mean firing frequency per neuron ( X K / ~ )for the particular stimulus ensemble shown in A (filled dots) and for an ensemble having the same standard deviation as in A, but a cut-off frequency of 100Hz (squares). The choice of an appropriate stimulus bandwidth strongly decreases the mean square error without changing significantly ILB.
60
Fabrizio Gabbiani and Christof Kocli
= 1000 Ilz in Figure 4E and F (filled dots). We obtain 1~8122 50 bit/sec when each model neuron fires at a frequency of 100 Hz. The relative mean error in the reconstructions is, however, very high, c, = 0.98, indicating a poor reconstruction: only 1 - 0.98 = 2‘7( of the stimulus is effectively encoded in the spike trains of the cells. This is further illustrated in Figure 1 A and C. As observed by Bialek c’t nl. (19911, if we smooth the stimulus and reconstructions Ivith a 5 msec half-width gaussian filter the quality of the reconstructions (as measured by f , ) increases considerably to f , = 0.73 (see Fig. 3B and D). In our model, this is a consequence of the low-pass filtering by K of the stimulus 5 prior to tlie spike-generating mechanism. A time constant i- .= 20nisec for the filter K implies that frequencies in tlie signal above 100 Hz are strongly attenuated and not well encoded by the two model neurons. Hence, although the neurons transmit a significant amount of information in tlie frequency band from 0 to 100 Hz, the quality of tlie reconstructions as measured by t , cannot be significantly better than chance, since the neurons transmit almost no information in the frecluency band between 100 and 100OHz. This is in agreement with equations 2.3, 2.5, and 2.6 for f , and I L B presented in Section 2. By smoothing the stimulus and reconstructions with a 5 msec half-width gaussian filter we strongly suppress the high frequencies in the stimulus ensemble, thereby revealing the improved accuracy of the reconstructions at low frequencies. As discussed in Section 3, a similar low-pass filtering is expected in the ily H1 cell and it has indeed been reported that in experimental reconstructions, tlie H1 neuron provides information about the stimulus mainly for frequencies below 25 Hz (Bialek ct n l . 1991, Fig. 3). Similar or lower temporal cut-off frequencies in neurons respunding to visual stimuli have also been observed in cat striate cortex and monkey area V1 (DeAngelis t? a/.1993; Richmond d a/.1994). Finally, we plotted for tlie same firing rates in Figure 4E and F the relative mean error and lower bound on the information transmission rate for a white noise stimulus having the same standard deviation in the time domain as the previous one but with a cut-off frequency fc = 100 Hz (open squares). In both cases (f, = 100 and f, = 1000 Hz), peak signal-to-noise ratios increase linearly with mean firing rate from a minimum of 1.5 at a firing rate Ah. 12 = 10 Hz u p to a maximal value of 16 for XK/2= 100 Hz (see equation 4.2). The rate of information transmission remains almost unchanged compared to the pre\Jious example (as expected from equation 2.6, since the signal still covers the frequency range in which the model neurons effectively transmit information) but the quality of the linear recontructions as measured by 6 , improves significantly. Clearly, a reduction in the bandwith of the stimulus ensemble always leads to a reduction of the relative mean error f , (see equation 4.5, F‘ - 0 as dC + 0). However, as soon as the bandwith of the stimulus becomes smaller than the frequency range in which tlie cell transmits information, ILB decreases in parallel with f, (in the limit li, - 0, ILB - 0 as well, see equation 4.6); this is not observed in the example of Figure 4E and F.
f,
Coding of Time-Varying Signals in Spike Trains
61
From equations 2.6 and 4.6 it is also possible to study the properties of the lower bound on the rate of information transmitted per spike, 1s = ILB/XK. In the example shown in Figure 4F, IS ranges from -, 1 bit/spike (when each neuron fires at a rate XK/2 = 10Hz) down to -, 0.5 bit/spike (for XK/2 = 100Hz); these values are similar to those observed experimentally in H1 (-, 0.75 bit/spike; Rieke 1991). As can be seen in this example, 1s is a monotonically decreasing function of the firing rate of the neurons. Furthermore, it can be shown that
Is 1 0
(XK + cx)
and
Is+-=
7l
- 1.13bit/spike 4 log(2) This latter limit is independent of the statistical properties of the stimulus and of the filter K describing the linear processing of the signal prior to half-wave rectification and the spike mechanism. Hence, the model of a linear and half-wave rectifying neuron with exponentially distributed random threshold studied here cannot reproduce the higher values for Is (- 3 bit/spike) observed experimentally in other preparations (Rieke et al. 1993).
5 Discussion
In this work, we analyzed and refined various aspects of the linear reconstruction method pioneered by Bialek and collaborators using analytical and numerical techniques. In brief, this method allows one to reconstruct the time-varying input signal from the spike train of a single neuron. We here argue that two different measures should be used to quantify the performance of this signal reconstruction method: the relative mean error, E*, and the lower bound on the rate of information transmitted by the estimated stimulus on the true stimulus, ILB. To apply the methods discussed in this paper, the following assumptions need to be made: (1) both the stimulus as well as the spike train must be (weakly) stationary, (2) ensemble averages must be replaced by time averages in numerical calculations (that is, the spike trains of the neuron and the sample functions of the stimulus must be jointly ergo&), and ( 3) deterministic components in the stimulus or spike train samples need to be subtracted prior to the reconstruction (see footnote 1). In addition, the lower bound ILBcan be computed only by further assuming that the stimulus ensemble is gaussian and ~ a n ~ w i iirnited. dt~ No such assumption needs to be made for the rest of the least-square signal reconstruction procedure. However, the ”effective noise” introduced in Section 2 needs to be neither gaussian nor independent of the stimulus. Using these assumptions, we derived a simplified model of motion encoding by H1 neurons based on the one developed by Gestri et al.
Fabrizio Gabbiani and Christof Koch
62
(1980) to fit their experimental data. We also derived closed analytical formulas for the reconstruction filter in the frequency domain, the mean square error, the signal-to-noise ratio, and the lower bound on the rate of information transmission, and showed that for a plausible choice of parameters, our model reproduces qualitatively several aspects of the reconstruction experiments performed by Bialek ef al. (1991). Because of the simplifications that were made, our model can by no means reproduce all of their results. For example, a meaningful comparison of the noise in the reconstructions with the noise at the photoreceptor level (see Fig. 3 of Bialek ef aI. 1991) would require modeling the elementary motion detection circuitry presynaptic to H1 as well as including a more realistic distribution of the noise that was placed at the spiking threshold mechanism in Section 3. Nevertheless, the assumptions underlying our model have been clearly isolated, allowing it to be completely understood. In Sections 2 and 4, the quality of reconstructions was assessed by computing the relative mean error F , (see equations 2.2, 2.5, and 4.5) in addition to the lower bound I L B used in earlier works (Bialek et al. 1991; Rieke ef al. 1993). Computing the mean relative error, as well as the rate of information transmission in different cases, allows us to study the significance of a measured rate of information transmission in terms of the reconstruction performance of the cell. Why use F, to characterize the performance of the reconstruction method? Two important properties of the relative mean error are (1) it can be computed under far less restrictive assumptions on the stimulus than I L ~ and (2) it compares directly stimulus and reconstructions in the time domain. This is, of course, the primary reason why the least-square metric is commonly used in engineering and other applications, where a signal and a reconstructed or corrupted version of this signal need to be compared. In contrast, as shown by equation 2.6 and by the numerical example of Section 4, I L B is able to reveal effective information transmission in a given frequency band but it does not compare directly the presented stimulus with the reconstructed one. It is for example possible for a model neuron to transmit 50 bit/sec of information about a time-varying stimulus, yet to reproduce less than 3% of that stimulus (see Fig. 4A and C as well as E and F for the parameters CT = 132 deg/sec, fc = 1000 Hz, and XK/2 = 100 Hz). If, on the other hand, the cut-off frequency of the stimulus is chosen differently, the same model neuron firing at the same rate can now reproduce more than 20% of the stimulus for a similar value of ILB (see Fig. 4E and F for the parameters = 132 deg/sec, fc = 100 Hz, and &/2 = 100 Hz). Hence, in these examples ILB does not determine which portion of a time-varying stimulus is encoded in the spike train of a cell in the mean square sense. Furthermore, if the stimulus ensemble contains significant power in a frequency range that is not encoded by the cell, this part of the power spectrum contributes entirely to the mean square error (see Section 2, equation 2.4 and Section 4, numerical example, fc = 1000 Hz). It is therefore necessary-to deterN
Coding of Time-Varying Signals in Spike Trains
63
mine which portion of a time-varying signal can be encoded by a cell in the mean square sense-to compute the experimental mean square error using equation 2.2 and to choose stimuli whose bandwidth is matched to the encoding possibilities of the recorded cell. Finally, if more natural stimulus ensembles are used to drive our model neurons, the rates of information transmission ILBobtained are lower than those obtained using white noise (Gabbiani 1995). In spite of this, the mean relative error in signal reconstructions decreases compared to white noise stimuli, showing that a better performance is achieved. This shows that both ILB and E , depend on the stimulus ensemble chosen to drive the cell and that while I L B is likely to be overestimated by the choice of white noise stimuli, the performance of the cell (F,) is likely to be underestimated. The closed formula for the reconstruction filter of our model depends on the statistics of the stimulus ensemble as well as on the mean firing rate of the cell (see Section 4, properties 1 and 2). This theoretical result was shown to translate into changes of the reconstruction filter when the mean firing rate of the model was varied in the physiological range (see Fig. 3). We therefore expect that by using different stimulus ensembles in the range of frequencies encoded by the cell (for example, different bandwidths), one can demonstrate changes in the shape of experimental reconstruction filters.’ While in a natural environment the ensemble distribution of stimuli (such as velocity signals) encoded by a cell might not change significantly over time, the firing rate of a neuron is expected to change (with the mean contrast of the visual scene for example). This also implies in our model a change of the reconstruction filter. Hence, our results do not lend support to the idea that single synapses might serve as decoders of presynaptic spike trains, as suggested by Bialek et al. (1991), since the decoding algorithm might depend on additional parameters of the stimulus ensemble or on biophysical properties such as the firing rate of a cell. However, our results cannot be regarded as conclusive in this respect because effects that were not directly taken into account here (such as firing rate adaptation or saturation) might play an important role in determining the shape of k . In the case of the H1 neuron, for example, it will be of interest to relate the changes (or the invariance) of k to the biophysical properties of H1 and its presynaptic elements. As demonstrated by property 3 of Section 4, the optimal reconstruction filter in the class of models considered here coincides with the reverse-correlation function in the limit of low firing rates. This rigorous result is of interest since it establishes a clear connection between reconstructions and the more traditional reverse-correlation method that has been extensively used to study the auditory and visual systems. It confirms the intuition that (in the limit of low firing rates) the optimal es9Up to now, only changes in the spatial characteristics of the stimulus ensemble have been tested (see Bialek rf ill. 1991).
64
Fabrizio Gabbiani and Christof Koch
timate of the stimulus preceding a spike is given by reverse-correlation. Furthermore, it offers an explanation for the observation that reversecorrelation can be a successful method of reconstruction in certain cases (Gielen ef nl. 1988). Finally, ~ ‘wish e to point out that the reconstructions performed o n H1 in the house fly a n d o n cells in other animals, as well as this theoretical work, leave totally open the important problem of determining whether the information on a time-varying stimulus that can be encoded in a neuronal spike train is actually used by the organism, i.e., the problem of correlating measures such a s I L B and c, with the behavior of the animal.
We tvould like to thank J. Friihlich a n d K. Hepp for very useful discussions on the subject treated here. The comments given by W. Bialek and R. de Ruyter van Steveninck on this manuscript are also gratefuly acknowledged. This work was supported by a grant of the Roche Research Foundation a n d in part by the Center for Neuromorphic Systems Engineering as a part of the Xational Science Foundation Engineering Research Center Program, a n d by the California Trade and Commerce Agency, Office of Strategic Technology.
Adrian, E. 1928. 71ic Bo5i.i of S ~ v ~ s ~ z t ifhc o u Aitiori ; ojtlic Scirsr O J ~ J IChristophers, S. London. Biaiek, W. 1989. Theoretical physics meets experimental neurobiology. In Lrctiircs irr Coriiplt,.r Syfmzj, SFI Stirilic.5 i t i tlic Siicriic~s ofCoriiplc,.rit!/, Vol. 11, E. Jen, ed., pp. 513-595. Addison-Wesley, N e w York. Bialek, W. 1992. Optimal signal processing in the nervous system. In Prirzcc’toii Lritriws ill Biophysics,pp. 321401. World Scientitic, Singapore. Bialek, W., and Zee, A. 1990. Coding and computation with neural spike trains. I. Stnt. P / I ~ / s59(1), . 103-115. Bialek, W., de Ruyter v a n Steveninck, R., and Warland, D. 1991. Reading a neural code. Scicvi1-c 252, 1854-1857. Bnrst, A,, and Egelhaat, M. 1987. Tcnymral moclulation of luniinance adapts time cunstant o f fly mo\,ement detectors. B i d . C $ J U J ~56, . 209-215. Borst, A,, Egelhaaf, M., and Seung, H. 1993. Two-dimensional niotion perception in flies. Ncirrnl Coirip. 5, 656-866. DeAngelis, G., Ohzawa, I., and Freeman, R. 1993. Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. I. General characteristics and postnatal development. I . N~~irropliysiol. 69, 1091-1117. de Ruyter van Steveninck, R., Zaagman, W., and Masterbroeck, H. 1966. Adaptation of transient responses of a movement-sensitive neuron of the visual system of the bowfly cf?f/i[~h’~i i,l.!/fiil.uct’~7liain. Bioi. c!/hw. 54, 223-236.
Coding of Time-Varying Signals in Spike Trains
65
Eckert, H. 1980. Functional properties of the H1-neuron in the third optic ganglion of the bowfly Pknenicia. I. C O J I IPhysiol. ~. A 135, 29-39. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d . Cybern. 60, 121-130. Egelhaaf, M., and Reichardt, W. 1987. Dynamic response properties of movement detectors: Theoretical analysis and electrophysiological investigation in the visual system of the fly. B i d . Cybevn. 56, 69-87. Gabbiani, F. 1995. Coding of time-varying signals in spike trains of linear and half-wave rectifying neurons (submitted). Gestri, G. 1971. Pulse frequency modulation in neural systems, a random model. Biophys. J. 11,98-109. Gestri, G., Masterbroek, H. A. K., and Zaagman, W. H. 1980. Stochastic constancy, variability and adaptation of spike generation: Performance of a giant neuron in the visual system of the fly. B i d . Cybevn. 38, 3140. Gielen, C., Hesselmans, G., and Johannesma, P. 1988. Sensory interpretation of neural activity patterns. Math. Biosci. 88, 15-35. Gray, C., Konig, P., Engel, A., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338, 1698-1702. Hausen, K. 1984. The lobula-complex of the fly: Structure, function and significance in visual behavior. In Photoreception and Vision in Invertebrates, M. Ali, ed., pp. 523-559. Plenum Press, New York. Hausen, K., and Egelhaaf, M. 1989. Neural mechanisms of visual course control in insects. In Facets of Vision, D. Stavenga and R. Hardie, eds., pp. 391424. Springer-Verlag, Berlin. Jian, S., and Horridge, G. 1991. The HI neuron measures changes in velocity irrespective of contrast frequency, mean velocity or velocity modulation frequency. Phil.Truns. X. Soc. London B 331,205-211. Land, M., and Collett, T. 1974. Chasing behavior of houseflies (Fannia caniculuris). A description and analysis. I. Cornp. Physiol. 89, 331-357. Laughlin, S. 1989. Coding efficiency and design in visual processing. In Facets of Vision, D. Stavenga and R. Hardie, eds., pp. 213-234. Springer-Verlag, Berlin. Maddess, T., and Laughlin, S. B. 1985. Adaptation of the motion-sensitive neuron H1 is generated locally and governed by contrast frequency. Proc. R. SOC.London B 225,251-275. Middlebrooks, J., Clock, A., Xu, L., and Green, D. 1994. A panoramic code for sound location by cortical neurons. Science 264, 842-844. Miller, M., and Sachs, M. 1983. Representation of stop consonants in the discharge patterns of auditory nerve fibers. 1.Acoust. SOC.A m . 74(2), 502-517. Moiseff, A., and Konishi, M. 1981. Neuronal and behavioral sensitivity to binaural time differences in the owl. J. Neurosci. 1, 4048. Oppenheim, A., and Schaffer, R. 1975. Digital Signal Processing. Prentice-Hall, Englewood Cliffs, NJ. Optican, L., and Richmond, B. 1987. Temporal encoding of two-dimensional
66
Fabrizio Gabbiani and Christof Koch
patterns by single units in primate inferior temporal cortex. J. Neiiropliysiol. 57, 162-178. Poggio, G., and Viernstein, L. 1964. Time series analysis of impulse sequences of thalamic somatic sensory neurons. J. Nenvophysiol. 27, 517-545. Poor, H. 1994. A17 [nt~odiictionto Sigl-in1Detectioii nmf Estimtioii. Springer-Verlag, New York. Rabiner, L., and Gold, B. 1975. Tlieory n ~ i dApplicntioii of Digital S i p d Processing. Prentice Hall, Englewood Cliffs, NJ. Reichardt, W., and Schlogl, W. 1988. A two dimensional field theory for motion computation. First order approximation: Translatory motion of rigid patterns. Bid. CyDerri. 60, 23-35. Richmond, B., Heller, J., and Hertz, J. 1994. Neural response structure and dynamics of single neuronal information transmission in the visual system. In Proceediiigs of the lriterrintioiinl S!piposiiiin oii Dyrrnuiics of Nenrnl Processing, pp. 89-92. Washington, DC, June 6-8. Rieke, F. 1991. Pliysicnl priiiciples underlyirig sensory processing atid compiitntioii. Ph.D. thesis, University of California at Berkeley, Berkeley, CA. Rieke, F., Warland, D., and Bialek, W. 1993. Coding efficiency and information rates in sensory neurons. EI~YO~J/~!/S. Lett. 22(2), 151-156. Saleh, B. 1978. Pliotoelrctron Stntistics. Springer-Verlag, Berlin. Simmons, J. 1979. Perception of echo phase information in bat sonar. Scieizct, 204, 1336-1338. Softky, W., and Koch, C. 1993. The irregular firing of cortical cells is inconsistent with temporal integration of random EPSP's. J . Neiirosci. 13, 334-350. Srinivasan, M. 1983. The impulse response of a movement-detecting neuron and its interpretation. Vis.Res. 23(6), 659-663. Strausfeld, N., and Bassemir, U. 1985. The organization of giant horizontalmotion-sensitive neurons and their synaptic relationships in the lateral deutocerebrum of Cnllipliorn er?ythroceplialnand Mirsca domestico. Cell Tissue Res. 242, 531-550. Wiener, N. 1949. Extrnpolntion, lnterpolntiori arid Smoothing of Stntioimry Tiirre Series. John Wiley & Sons, New York.
Received August 2, 1994; accepted March 31, 1995.
This article has been cited by: 2. Jan A. Freund, Alexander Nikitin, Nigel G. Stocks. 2010. Phase Locking Below Rate Threshold in Noisy Model NeuronsPhase Locking Below Rate Threshold in Noisy Model Neurons. Neural Computation 22:3, 599-620. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary material] 3. Aurel A. Lazar, Eftychios A. Pnevmatikakis. 2009. Reconstruction of Sensory Stimuli Encoded with Integrate-and-Fire Neurons with Random Thresholds. EURASIP Journal on Advances in Signal Processing 2009, 1-15. [CrossRef] 4. Maurice J. Chacron, Benjamin Lindner, André Longtin. 2007. Threshold fatigue and information transfer. Journal of Computational Neuroscience 23:3, 301-311. [CrossRef] 5. Laura Cozzi, Paolo D’Angelo, Vittorio Sanguineti. 2006. Encoding of Time-varying Stimuli in Populations of Cultured Neurons. Biological Cybernetics 94:5, 335-349. [CrossRef] 6. Benjamin Lindner, Maurice J. Chacron, André Longtin. 2005. Integrate-and-fire neurons with threshold noise: A tractable model of how interspike interval correlations affect neuronal signal transmission. Physical Review E 72:2. . [CrossRef] 7. M. Kropp, F. Gabbiani, K. Prank. 2005. Differential coding of humoral stimuli by timing and amplitude of intracellular calcium spike trains. IEE Proceedings - Systems Biology 152:4, 263. [CrossRef] 8. Maurice Chacron, Benjamin Lindner, André Longtin. 2004. Noise Shaping by Interval Correlations Increases Information Transfer. Physical Review Letters 92:8. . [CrossRef] 9. Relly Brandman , Mark E. Nelson . 2002. A Simple Model of Long-Term Spike Train RegularizationA Simple Model of Long-Term Spike Train Regularization. Neural Computation 14:7, 1575-1597. [Abstract] [PDF] [PDF Plus] 10. Amit Manwani , Peter N. Steinmetz , Christof Koch . 2002. The Impact of Spike Timing Variability on the Signal-Encoding Performance of Neural Spiking ModelsThe Impact of Spike Timing Variability on the Signal-Encoding Performance of Neural Spiking Models. Neural Computation 14:2, 347-367. [Abstract] [PDF] [PDF Plus] 11. Amit Manwani , Christof Koch . 2001. Detecting and Estimating Signals over Noisy and Unreliable Synapses: Information-Theoretic AnalysisDetecting and Estimating Signals over Noisy and Unreliable Synapses: Information-Theoretic Analysis. Neural Computation 13:1, 1-33. [Abstract] [PDF] [PDF Plus] 12. A. N. Burkitt , G. M. Clark . 2000. Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic InputsCalculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs. Neural Computation 12:8, 1789-1820. [Abstract] [PDF] [PDF Plus]
13. Alexey Pavlov, Olga Sosnovtseva, Erik Mosekilde, Vadim Anishchenko. 2000. Extracting dynamics from threshold-crossing interspike intervals: Possibilities and limitations. Physical Review E 61:5, 5033-5044. [CrossRef] 14. A. N. Burkitt , G. M. Clark . 1999. Analysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike OutputAnalysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike Output. Neural Computation 11:4, 871-901. [Abstract] [PDF] [PDF Plus] 15. Terence David Sanger . 1998. Probability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking NeuronsProbability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking Neurons. Neural Computation 10:6, 1567-1586. [Abstract] [PDF] [PDF Plus]
Communicated by William Bialek
A Simple Spike Train Decoder Inspired by the Sampling Theorem David A. August William B Levy Department of Neurosurgery, University of Virginia, Ckarlottesville, VA 22908 USA
Reconstructing a time-varying stimulus estimate from a spike train (Bialek's "decoding" of a spike train) has become an important way to study neural information processing. In this paper, we describe a simple method for reconstructing a time-varying current injection signal from the simulated spike train it produces. This technique extracts most of the information from the spike train, provided that the input signal is appropriately matched to the spike generator. To conceptualize this matching, we consider spikes as instantaneous "samples" of the somatic current. The Sampling Theorem is then applicable, and it suggests that the bandwidth of the injected signal not exceed half the spike generator's average firing rate. The average firing rate, in turn, depends on the amplitude range and DC bias of the injected signal. We hypothesize that nature faces similar problems and constraints when transmitting a time-varying waveform from the soma of one neuron to the dendrite of the postsynaptic cell. 1 Introduction
Recently, Bialek and colleagues have popularized the "decoding" approach for studying neuronal information processing (Bialek et al. 1991; Bialek and Rieke 1992; de Ruyter van Steveninck and Bialek 1988). To describe their method generically, let s ( t ) be a time-varying scalar representing a stimulus. In response to this stimulus, the neuron emits a sequence of impulses, { t , } , which mark each time of spike generation. A decoding filter, h ( t ) , is then applied to the spike train, to obtain a new signal, j.(t), which is a continuous, time-varying estimate of the stimulus. The closer ? ( t ) is to s ( t ) , the better the spike train has preserved the stimulus information. This decoding approach has sparked a renewed interest in neural coding, in part because it offers an excellent way for the experimenter to extract much of the information from a spike train. Here, we propose another decoding technique. As an alternative to the methods of Bialek and colleagues, this method takes into account the relationship between cell Neural Computation 8, 67-84 (1996)
@ 1995 Massachusetts Institute of Technology
David A. August and William B Levy
68
firing and signal bandpass, the role of short-term synaptic modifiability, and a liniitation to simple postsynaptic filtering functions. I n previous experiments, s( t was an environmental stimulus, exttrnal to the organism. However, in this paper, we asked how information might be transmitted between two monosynaptically connected neurons. Thus, s(t J represented a time-varying current injected into the presynaptic cell. At the postsynaptic site, this information would be decoded (by conductance events), and irzferpolntd into a continuous (current or voltage) signal. Our estimated signal, -sit), would thus correspond to the subsynaptic waveform. Briefly, the decoding method consisted of three steps. First, a current waveform, s ( t ) , was injected into a model spike generator, which produced a sequence of interspike intervals (ISIS). Next, these ISIs were translated into a sequence of estimated amplitudes (”samples”)of s(t ) by a decoding function based on a natural presynaptic process. Finally, these samples were linearly interpolated into a continuous reconstruction, s( t ) . The decoding method proposed here was quite successful, provided that s i t ) was appropriately matched to the spike generator. To understand what is meant by appropriate matching, we must consider spikes as samples, as above. The Sampling Theorem (Shannon 1949; Nyquist 1928; Whittaker 1915) then implies that to avoid aliasing and to produce the best reconstructions, the stimulus bandwidth should be less than half the average sampling rate (firing rate) of the spike generator. Given a particular input signal bandwidth, appropriate adjustments of the signal’s DC offset and its amplitude range could produce this matching. I n what follows, we first describe the reconstruction method. Next, we show how the quality of reconstructions is affected by departures from tlie ideal conditions of the Sampling Theorem, as well as by tlie stimulus bandwidth and amplitude range. Finally, we discuss how this method relates to other studies, and the implications for experiments designed to measure information transmission between neurons. 2 Methods
-
-
2.1 Biological and Theoretical Considerations. The reconstruction technique proposed in this paper was based on linear interpolation, lowpass-filtered signals with restricted bandwidths, and nonlinear filtering. Linear interpolation was motivated by the desire to restrict memory to just the last spike. Low bandwidths (5100 Hz) were used because somatic signals-which we assume arise from passi\re dendritic filtering of synaptic inputs-would be bandlimited to approximately 25-200 Hz by typical membrane time constants (5-10 msec). Nonlinear filtering was used in correspondence to the nonlinearity of paired-pulse facilitation (I’PF) observed at some synapses (Zucker 1989; Katz and Miledi 1968). PPF, a decreasing function of ISI, acts as a nonlinear filter by decoding
Simple Spike Train Decoder
69
each spike into a different postsynaptic conductance, depending on the position of the most recent spike.' With respect to the three points above, previous studies have differed from the method proposed here. First, previous work used interpolating filters with characteristic nonlinear (biphasic or triphasic) shapes, extending over several iiiterspike intervals (Bialek et al. 1991; Theunissen 1993). To implement such functions each spike would have to trigger stereotyped event patterns (e.g., fast-EPSP followed by slow-IPSP for a biphasic shape). It remains to be seen whether the proper, stable EPSP/IPSP patterns are generally available at appropriate synapses in the brain, although the implications of this possibility have been noted by several authors (Bialek et al. 1991; Sakuranaga et al. 1987). Second, previous studies have employed input bandwidths of 500-1000 Hz (Reike et al. 1992, 1993; Bialek et a / . 19911, which were several times higher than average firing rates in most brain regions. While these sensory signals would certainly be lowpass-filtered by (nonauditory) sensory receptors, the filtered bandwidths, in relation to average firing rates, have not been measured. Finally, most previous studies (with the exception of an analysis of bullfrog sacculus data by Reike et al. 1992) have used linear, rather than nonlinear, filtering. 2.2 Reconstruction Method. We used GENESIS (Bower and Beeman 1995) to simulate the injection of random, time-varying current signals into a biophysically modeled soma. The model, a one-compartment cylindrical soma with sodium and delayed-rectifier potassium channels, is described in Tables 1 and 2. These channel parameters were based on the Hodgkin-Huxley squid axon model (Hodgkin and Huxley 1952). The simulation timestep was At = 30.51758 psec, and the duration of each signal was 1 sec (or 215 = 32,768 points). Current injection signals had the form s ( t ) = io+ai(t), where io was the DC bias, a was the amplitude scaling factor, and i(t) was a bandlimited gaussian noise signal. The signal i(t) was created by lowpass-filtering 32,768 samples of uncorrelated zero-mean, unit variance gaussian white noise, and then rescaling this signal to the [-1.11 interval. Filtering was done in the discrete frequency domain, by setting Fourier coefficients above the desired bandwidth to zero. Since each signal was generated 'Linear interpolation is like convolving each sample with a triangle-shaped filter, centered around each spike, and extending to the two nearest neigboring spikes. However, because of the PPF-like effect, each triangle has a different height. Thus, convolving the spike train with different-sized triangle functions is actually a nonlinear filtering operation. Thus, in the terminology of this study, the word "linear" has been used in two ways. The triangle-shaped decoding filter was a nonlinear filter. That is, its shape, when centered over the current spike, t,, actually depended upon spikes located at t,-i and t,+l, violating the superposition property of linear systems. However, the resulting reconstruction, 5 ( t ) , was actually a piecewise linear function of time. Most other studies have used linear filtering of spike trains to produce nonlinear reconstruction functions (e.g., Bialek ef a / . 1991; Warland et nl. 1992; Theunissen 1993).
David A. August and William B Levy
70
Table 1: Model Parameters." Parameter
Description
Value
Cylinder diameter Cylinder length
500 p i i i 500 l t l l l
Membrane resistivity Membrane capacitance
40,000 12-cm' 1 pF/cm'
Resting potential Sodium reversal potential Potassium reversal potential
-60 mV 55 mV -72 mV
Sodium channel density Potassium channel density
120 mS/cm' 36 mS/cm'
d 1
gNa
XK
"The reversal potentials and channel densities correspond to the original HodgkinHuxley model for a squid motor axon.
Table 2: Channel Parameters for Na and K Channels in the Hodgkin-Huxley Spike Generator." Parameter
A
B
C
D
F ~
(~11,
-I,,, [)/I jll
"11 dll
O.l(E, + 25) 4 0.07 1 0.01(E,+10) 0.125
-0.1 0 0 0 -0.01 0
-1
0 0 1 -1 0
-(E,+25) -E* -Er -(Er+30) -(E, +10) -Er
_
_
_
_
-10 18 20 -10 -10 80
"To be consistent with GENESIS, each equation for CY or d is written with five parameters, indicating the functional form (A + B V ) / ( C + e ( v S D j / F ) . Here, €, is the resting potential, -60 mV. The sodium and potassium channel conductances are given by g p ~=~g ~ ~ i i z ' and h gh = &d, respectively, where the gating variable I I = m3/'f,or If, obeys the equation du/df = tu,,(l - 1 1 ) - I j i i i i .
with a different random number seed, s ( t ) (for a given io, a, and bandwidth W ) were realizations of a bandlimited gaussian random process. We chose a gaussian distribution because, by the central limit theorem, the dendritic filtering of many synaptic inputs would be expected to approach a gaussian at the soma. The decoding function, which translated ISIs into estimates of the stimulus, was constructed empirically in the following manner. First, 10 different gaussian signals were injected into the spike generator. From the 10 resulting spike trains, each IS1 (ISli = t , - tl+l), as well as each instan-
_
71
Simple Spike Train Decoder
taneous current injection, s ( t , ) , was recorded. These pairs of ( I S l .s ( t , ) ) produced a scatterplot, like the one shown in Figure 2. The inverting function was obtained by fitting this scatter of points, using Mathematica, to a third-degree polynomial f(IS1) = cg c l / l S I + c2/IS12 + c3/1SI". The reconstruction method was tested by injecting novel signals, generated with different random seeds, into the model spike generator. The inverting function transformed each new spike train into a sequence of "samples" s(t l ) = f(IS1,) corresponding to the instantaneous current at each spike time. To create a continuous signal, i(t), these samples were linearly interpolated. That is, for t , - ~5 t 5 t,,
+
(2.1) The reconstruction error between the original and estimated signals was then quantified using relative root-mean-square error (rRMSE), (2.2) and, in dB, signal-to-error ratio (SER), defined as SEX = -20 Iog,,(vXMSE), where ( ) denotes a time average. 2.3 Relationship to Sampling Theorem. The Sampling Theorem states that a signal bandlimited to W Hz, uniformly sampled at a rate of 1/T, can be exactly reconstructed from the samples, provided that (1/T) > 2W. The reconstruction is obtained from
(2.3) where sinc(x) = sin(Tx)/(nx).The just described neuronal ("scatterplot") method reconstructs a continuous signal from the samples ? ( t , ) = f ( t l t l - l ) according to the formula
s ( t ) = Ci(t,)A(t - t,. f , - i .
(2.4)
tl+,)
I
where A( ) is the triangle function, 1
+ (t/u). - (t/b),
-u 5 t < 0 0 5 t
(2.5)
Clearly, these two reconstruction techniques are conceptually similar. However, the neuronal ("scatterplot") method differed from the Sampling Theorem in three important ways: The scatterplot method used nonuniform, rather than uniform, sampling; linear, rather than nonlinear, interpolation, and estimated, rather than exact, sample amplitudes. ,
David A. August a n d William B Levy
72
Table 3: Comparing Different Reconstruction Methods."
Sampling interval
Interpolation method ~
sc 1 2 7
4
-
2
6
ST
Nonuniform Uniform Nonuniform Nonuniform Uniform Uniform Nonuniform Lniform
Sample amplitudes
NUS
__
Linear Linear Linear Nonlinear Linear Nonlinear Nonlinear Nonlinear
Difference(s) from Sampling Theorem
Estimated Estimated Exact Estimated Exact Estimated Exact Exact
Ll
EST
X
The scatterplot reconstruction method, (SC), differed from tlie Sampling Theorem (ST), in three ways: nonunitoriiily spaced samples (NUS), linear interpolation (LI), and estimated sample ,implituJes (EST) To tsvaludte the contribution of cach of these error sources individually and in pairs, ~ I additionnl X reconstruction methods were emploved. Vethods 1,5, and h diiter from the Sampling Theorem in just one \\a?, while methods 1 , 7, aid 3 differ in tivo lvays.
2.4 Analysis of Reconstruction Error. We wished to compare the error of our neuronally inspired reconstruction method to the theoretical lowest possible error predicted by the Sampling Theorem. However, because our procedure departed from the theorem in three respects, a direct comparison was not meaningful. Therefore, we implemented six other reconstruction methods, as shown in Table 3 . Each of these methods diffcwd from the Sampling Theorem in either one or two respects. Thus, for example, comparing methods 4, 5, or 6 to the Sampling Theorem (ST) demonstrated the individual effects of linear interpolation, estimated sample amplitudes, and nonuniform sampling, respectively. Comparing methods 1, 2, or 3 to tlie Sampling Theorem showed the combined effect of pairs of these error sources. Finally, comparing the scatterplot method (SC) to the Sampling Theorem illustrated the combined effect of all three error sources. The reconstruction methods are named and described as follows: (ST) A direct implementation of the Sampling Theorem, in which the original signal was sampled uniformly (at the spike train's average firing rate) and these sample points were interpolated with sinc-functions. The resulting error was the lowest possible, given the limitations of finite length signals, non-brickwall filters, and finite precision calculations. (1) Linear interpolation of uniformly spaced, estimated samples. In this method, the exact sample amplitudes from the method above were corrupted by additive, zero-mean, gaussian white noise. The variance of this noise, 72.25 nA, was estimated by averaging the variance from 25 different IS1
Simple Spike Train Decoder
73
bins along the scatterplot, each 0.1 msec in width. (2) A method in which exact nonuniformly spaced samples, taken from the original signal, were linearly interpolated. As sample points, we could have chosen the actual spike times produced by a current injection. However, this would have led to oversampling of the waveform peaks (when the spike generator sped up) and undersampling waveform troughs. To avoid this bias, but still retain the same IS1 distribution, we simply used the spike times produced by one current injection, s l ( t ) , to sample a different signal, s,(t). (3) A reconstruction technique for nonlinear interpolation of nonuniformly spaced samples based on Yen’s method (Yen 1956), which used variable sinc-functions. In this case, estimated, rather than exact, sample amplitudes were used. Again, these corrupted amplitudes were produced by adding N(0.72.25) noise to the original amplitudes.* (4) A method in which exact, uniformly spaced samples were connected by straight lines, rather than sinc-functions. (5) The Sampling Theorem applied to uniformly spaced sample amplitudes, corrupted by N(0,72.25) noise. (6) Yen’s method applied to nonuniformly spaced, exact sample amplitudes, taken directly from the original signal at the specified times. The rRMSE of the scatterplot method (see Table 3 ) was denoted Esc; the rRMSE of the Sampling Theorem implementation was denoted EST; the other rRMSEs were denoted E l , E 2 . . . . . Eb, corresponding to the six reconstruction methods above. To interpret the total error, Esc, in the context of the Sampling Theorem, each of the above reconstruction errors was first normalized by I&. Then, the closer a given normalized error ratio was to 1, the more of E s c was explained by that particular departure (or departures) from the Sampling Theorem. 3 Results 3.1 Steady-State Behavior. Before injecting time-varying current signals, we studied the steady-state behavior of the Hodgkin-Huxley model by constructing a frequency-intensity (f/I) curve, shown in Figure 1. Consistent with previous reports (Agin 1964; Stein 1967), the model spike generator abruptly began repetitive firing at about 50 Hz, rose to a maximum of 170 Hz, and declined sharply. 2Although in theory Yen’s third method produces perfect reconstructions, in practice it is extremely sensitive to amplitude errors. Indeed, these facts may have been known to Cauchy in 1841 (see Black 1953; Jerri 1977; Marks 1991), and were certainly known to Shannon, who wrote, “[tlhe 2WT numbers used to specify the function need not be the equally spaced samples.. . . For example, the samples can be unevenly spaced, although, if there is considerable bunching, the samples must be known very accurately to give a good reconstruction of the function” (Shannon 1949). We have found that the requirement for high accuracy renders Yen’s third method useless in practice. However, Yen’s fourth method, which applies the constraint of minimum energy, can produce reasonable reconstructions. Accordingly, all references to Yen’s method hereafter denote his fourth method.
David A. August and William B Levy
71
200
N
2
150
2’ c a
3
0-
2
100
a,
c
m
c
2
U m a,
Gi
50
0
0
200
400 600 800 Stimulus intensity (nA)
1000
1200
Figure I : The steady-state frecluenc)./intensit~ (i/I) curve. The model spike generator began tiring in response to a 35 nA current injection and eventuaily rcaclied its maximum response at 12-10 rtA. 3.2 The Inverting Curve. For the first reconstruction experiments, tlie tiring rate was set at = 100 Hz by confining the current injection signal to 35-43.? nA. That is, sib) = i,, n i i t ) , with i,, = 235 nA, 17 = 200 nA, and i ( t ) E [ - I . 1:. In order to meet the Nyquist criterion, i ( t ) was lowpassfiltered to a 40 Hz bandividth. Ten such signals were injected into the model spike generator. Figure 2 shows the resulting scatterplot [obtained by plotting (151).s i t , ) i pairs] and the inverting function, f(IS1).
-
3.3 Reconstruction Error and the Sampling Theorem. Next, 10 novel 40 Hz gaussian signals were reconstructed using this inverting curve. Figure 3 shows one representative reconstruction (rRMSE = 0.3648). For all ten signals, the quality of reconstructions was similarly higli, with rliMSE = 0.3623 3= 0.0121, and SEIZ = 8.8636 0.2950 J B . To estimate how much of this error was due to the various departures from the Sampling Theorem, the signals were also reconstructed using the techniques shown in Table 3. The average rRMSEs for these different reconstruction methods are given in Table 4. Coniparing methods 4,5, and 6 to tlie method ST shows the effect of individual departures iron1 the Sampling Theorem. Since E4/Esc = 0.89 \\‘as the highest error ratio, linear interpolation was the largest single
*
Simple Spike Train Decoder
75
500
400
T v
*
5
300
L L-
S
0
U
a,
iij 200
E .c
d
100
0 ' 0.008
I
I
I
I
I
0.01
0.012
0.01 4
0.016
0.018
lnterspike interval (sec)
Figure 2: Scatterplot and inverting curve. Pairs of points ( l S I l . s ( t t ) )were plotted (open circles) to form a scatterplot. This scatter of 936 points contains data from 10 I-sec long 40-Hz bandlimited gaussian noise signals. To obtain an inverting curve (solid line), this scatter of points was fit with the function f (1SI) = 183.565 - 0.433928/ISI - 0.0447669/1Sl2+ 0.O0O538129/1Sl3. source of error. Less of a penalty was paid for nonexact sample amplitudes (Es/Esc = 0.65) and nonuniform sampling (EhIEsc = 0.64). As expected, errors for reconstruction methods that combined linear interpolation with still another departure from the Sampling Theorem ( e g , €1 and E2) accounted for almost all of the total error of Esc. 3.4 Tuning the Stimulus to the Spike Generator. We investigated the role of matching the injected current with the spike generator in two ways. First, we held the stimulus bandwidth constant (40 Hz), and varied its amplitude range. We hypothesized that increasing the firing rate (sampling rate) would improve reconstructions. Since the average firing rate was controlled by the DC bias and amplitude of the stimulus, we compared reconstructions from signals restricted to different sections of the f/I curve. Signals were restricted to either the full (35-435 nA), low (35-335 nA), middle (135-335 nA), or high (135-435 nA) range of the f/I curve, and a different scatterplot was constructed for each range. The average firing rate for each of these ranges was high (106.010.5Hz) > middle (100.0 f0.3 Hz) > full (95.2 f 0 . 8 Hz) > low (88.1 10.8 Hz). Figure 4
David A. August and William B Levy
76
500
3
400
a
5 300 ar
3 .c
200 Q 100
0
0.2
0.4 0.6 Time (sec)
0.8
1
Figure 3: Reconstruction of a time-varying current injection. The original signal (dotted line) was W H z lowpass-filtered gaussian noise with an amplitude range of 35435 nA. The reconstructed signal (solid line), obtained from the inverting function in Figure 2, had an rRMSE of 0.3643 and an SER of 8.7587 LIB. (See Methods for a description of how to calculate rRMSE and SER.) This reconstruction \vas based on 92 pairs of IISI, $1 f , )) points (filled circles). shows that the reconstruction errors were ranked in the opposite order to the firing rates: Y M S E , , , ~< , , rRMSE,,,,,i,11, < rRMSEtt,l1 < YRMSEI,,,, . Thus, larger DC bias, which produced higher firing rates, also produced better reconstructions, in agreement with the Sampling Theorem idea. The second type of stiniulus/spike-generator matching was tested by holding the stimulus amplitude constant (135-435 nA), and varying its bandwidth. Since current injections in this range produced average firing rates of 106 Hz, we hypothesized that stimuli bandlimited to less than 53 Hz would be reconstructed better than those with larger bandwidths. To test this, we compared reconstructions from 20, 40, 60, and 80 Hz bandlimited gaussian noise signals. Again, a different scatterplot was constructed for each bandwidth. As shown in Figure 5, reconstruction error increased with the stimulus bandwidth. Aliasing, most pronounced in the Sampling Theorem method, was seen as a sharp increase in the error when the signal’s bandwidth was changed from 40 to 60 Hz. Interestingly, the reconstruction methods using linear interpolation degraded more gracefully than the methods using nonlinear interpolation.
Simple Spike Train Decoder
77
Table 4: Reconstruction Errorsfl
Differenceb) from Sampling Theorem
NUS
sc
X
1 2 3 4 5 6 ST
LI
EST
rRMSE (fSE)
E,/Esc
0.239 f 0.007 0.252 f 0.005 0.218 f 0.004 0.164 f 0.005 0.213 & 0.004 0.155 f 0.006 0.153 i 0.012 0.009 f 0.001
1.oo
1.05 0.91 0.69 0.89 0.65 0.64 0.04
"For each of the reconstruction methods described in Table 3, the relative RMS error (rRMSE) is given here, along with the SE (n = 10). Also shown are errors normalized by the scatterplot method's error, Esc. Note that linear interpolation (method 4) alone accounts for nearly 90% of the total error. The input signals were scaled to the 135435 nA range.
4 Discussion
This paper has presented a simple technique for reconstructing a continuous, time-varying signal from a simulated spike train. Although this type of study has often been called "decoding," we emphasize our agreement with others (Perkel and Bullock 1968; Bialek and Rieke 1992) that neurons need not decode their incoming spike trains. Still, even without the presence of a literal decoder, information is present implicitly in spike trains. The question motivating the present research is how much information about a bandlimited waveform is preserved given a neuronal spike generator and a hypothesized synaptic decoding process that has memory no further back in time than the previous impulse. The reconstruction method described here is inspired by one biological process (PPF) and by-what is to us-the intuitively appealing restriction of limited temporal inference (no memory for spikes beyond the previous one). Linear interpolation, while no more or less biologically plausible than many other interpolation schemes, is one of the simplest interpolating functions that is consistent with the assumed temporal restriction. Thus, in addition to being a useful tool by which experimenters can decode spike trains, linear interpolation also represents a low complexity informationpreserving computation that a synapse might be able to accompiish.
David A. August and William B Levy
78
Increasing average spike rate m
0.4 L
2 L .
a,
v)
0.3 -
I
LT a,
.-> 0.2 a,
u: 0.1
-
0'
LOW (88.1)
FULL
MIDDLE
(95.2) (100.1 ) Amplitude Range
HIGH (106.0)
Figure 4: Increasing the tiring treyuency decreased the reconstruction error. Each bar represents the axwage rRMSE ( 5 SE) for scatterplot reconstructions of 10 input signals that had the same (40 Hz) bandwidth but were rescaled to different amplitude ranges along the f / I curve. As the average firing rate increased (from 2 88 Hz on the left to 2 106 Hz on the right), the rRMSE decreased. Generally, the firing frequency could be predicted by the mean value (DC level) of the stimulus. However, the average firing rate of "middle" range signals was larger than that of the "full" (100.1 i 0.3 Hz vs. 95.2 i- 0.8 Hz), and the rRMSE was smaller, e i m though they had the same DC bias. Numbers in parenthese5 belolv the amplitude ranges correspond to the average firing rate in H z . 4.1 Other Spike Generators. It has been suggested, because the Hodgkin-Huxley spike generator has such a narrow dynamic range (Fig. 1)a n d a regular firing rate, that the reconstruction method described here may not be generally applicable. However, the same reconstruction technique (August a n d Levy 1994) has also been applied to spike trains from a retinal ganglion cell (RGC) model (Fohlmeister ct nl. 1990). The wider dynamic range of the RGC model w a s reflected in a larger coefficient of variance (CV) of the IS1 histogram compared to the Hodgkin-Huxley model. For example, the CV of the RGC model, when stimulated with 50 H z bandwidth gaussian noise, w a s 0.26. This was over twice the CV ot the Hodgkin-Huxley spike generator when stimulated with 40 Hz
Simple Spike Train Decoder
79
-----
Sampling Theorem (EST) Other nonlinear interpolation methods (E3, E5,E6) Scatterplot method (Esc)
20
30
40
50
60
70
ao
Input signal bandwidth (Hz)
Figure 5: Increasing stimulus bandwidth increased reconstruction error. Signals were scaled to the same (135-435 nA) amplitude range, but were filtered with lowpass cutoffs of 20, 40, 60, or 80 Hz. That is, while average firing frequency remained constant (FZ 106 Hz), the stimulus bandwidth increased. The markedly increased errors for 60 and 80 Hz bandwidths indicate aliasing. Aliasing was most apparent in the reconstruction methods using nonlinear interpolation (solid lines). For the methods using linear interpolation (dashed lines), the increased slope of the rRMSE curve was less apparent.
(CV = 0.10) or 60 Hz (CV = 0.12) noise. Thus, increasing the spike generator’s dynamic range and the irregularity of spiking does not invalidate our approach. 4.2 The Sampling Theorem. This paper has related information transmission to communication theory by relating the average firing frequency of a neuron to the Sampling Theorem. However, while conceptually similar to a classical Sampling Theorem reconstruction, the method here differed by its use of nonuniformly spaced samples, nonexact sample amplitudes, and linear interpolation. A comparison of several different resonstruction methods (Table 3) revealed how much these three error sources contribute to the total overall error. For the particular signals studied, linear interpolation was the most significant single factor in increasing error (Table 4). However, the exact ranking of these errors may
80
David A. August and William B Levy
change depending on the bandwidth and amplitude range of the input signals (data not shown). Aliasing, as defined for sinc-function reconstructions (Couch 19871, is a type of reconstruction error caused by high-frequency signal components being "folded" back into lower frequencies, due to an inadequate sampling rate. The Nyquist rate is the sampling rate below which this folding occurs. To investigate the relevance of the Nyquist rate to neuronal communication, we examined the relationship between spike rate and stimulus bandwidth. In varying the firing frequency for a constant stimulus bandwidth (Fig. -Ireconstruction ), error decreased as firing frequency increased. In Lwying the input signal bandwidth while maintaining a constant firing frequency (Fig. 5), reconstruction error increased as this bandwidth increased. Thus, relative to communications theory, spike frequency seemed to be the appropriate analog of sampling frequency. Also, the importance of matching this sampling-or spike-frequency to the input signal bandwidth \%'asclearly apparent. In these aliasing studies, tlie reconstructions using linear interpolation degraded more gracefully than nonlinear interpolation as input bandwidth increased (Fig. 3. This observation may be biologically pertinent. Since rieuronal signals are not strictly bandlimited at these low frequencies, aliasing will likely be present i r i iiivo. Thus, it is notable that nature can combat aliasing by simplifying the iiiterpolatory scheme from nonlinear to linear. The relationship between cell firing and the Nyquist rate has implications for the experimental design of future studies that deliver a time-varying current injection to a neuron, record a spike train, and then "decode" this spike train into an estimate of the stimulus. While much work has gone into determining the appropriate decoding filter, the question of how the stimulus itself should be chosen remains open. Several researchers have already emphasized the complexity of naturalistic stimuli (Field 1987; Ruderman and Bialek 1994). Here, we propose a specific guideline for neurophysiologists. Gaussian current injection signals should be scaled so that the DC bias and amplitude range produce firing at rates comparable to those observed iri z l i z v . Next, tlie stimulus bandwidth should be limited to half this average firing rate or less, if a maximal capacity measurement is the goal. The relationship of the reconstruction method proposed here, the Sampling Theorem, and other decoding studies can be understood as follows. Bialek and colleagues have shown the theoretical optimality of linear decoding filters when the firing rate, R, becomes very small compared to the stimulus bandwidth, W (Bialek ct nl. 1993), and this has been confirmed experimentally by decoding sensory system spike trains with linear filters (Bialek rt al. 1991; Warland et a / . 1992; Theunissen 1993). Thus, linear decoding filters can be used quite successfully in the R < 2W range of neural dynamics. However, the Sampling Theorem suggests that the R > 2W range of dynamics may also be of interest. In this case, linear decoding
Simple Spike Train Decoder
81
may no longer be optimal, and the experimenter faces the more difficult task of constructing nonlinear filters. The present study shows, empirically, that a very simple (triangular-shaped) nonlinear filter-equivalent to linear interpolation-can still produce high-quality reconstructions. This method should prove useful to experimenters interested in decoding more slowly varying stimuli from spike trains with higher firing rates. 4.3 Limitations of the Model. The model of information transmission by ISIs applies to neural systems with a PPF decay similar to the average firing rate, a relatively small conduction jitter, and a relatively large quantal content. We hypothesize that if these conditions are not met, then the spike train is transmitting a frequency code. First, our reconstruction method requires that the average ISIs for the postsynaptic cell lie along a range of moderate slope on the PPF-like decoding curve (Fig. 2). If ISIs fall predominantly along the nearly flat region, then the EPSPs would all be the same size, as is the case for linear filtering methods (Bialek et al. 1991). At the squid giant synapse, the first component of PPF decays over 5-10 msec (Charlton and Bittner 1978), which is suitably matched to the z 100 Hz firing rate that has been used here. Similarly, PPF at spinal interneuron-motoneuron and at corticorubral synapses decays over 50 msec (Murakami et al. 1977; Kuno and Weakly 1972), which would be appropriate for cells firing at 10-20 Hz. Second, poor reconstructions could result from ISIs being distorted by a large axonal conduction jitter. Many neural systems have relatively little jitter. For example, in the frog sciatic nerve, jitter is z 4 psec, (Rapoport and Horvath 19601, < 50 psec in human motor axons (Salmi 1983; Stralberg et al. 19711, 100-200 psec in various reflex arcs (Trontelj 1973; Trontelj and Trontelj 1978), < 50 psec in the barn owl auditory system (Rose et al. 1967), < 40 psec in the bat echolocation system (Simmons 1979),and < 1 psec in the weakly electric fish (Bullock 1970). In fact, jitter as a noise source was already implicit in the reconstruction method here. Since the decoding function was fit to a scatterplot, the method must have been robust to at least the amount of scatter about this curve. For the 40 Hz signal shown in Figure 2, the widest scatter was approximately 200 psec, which provides a lower bound on the maximum tolerable jitter. Third, the reconstruction technique will not work at synapses with low and variable quantal content [ e g , hippocampal region CAI (Allen and Stevens 1994; Bekkers and Stevens 1990; Foster and McNaughton 1991; Hessler et al. 1993)], but is applicable where quantal content is high [e.g., squid giant synapse (Miledi 19671, frog neuromuscular junction (Martin 1955),and the climbing fiber synapses on cerebellar Purkinje cells (Llinas ef al. 1969)]. Because we invoke a PPF-like effect for decoding, and because PPF is thought to be caused by changes in release probability, p , a fairly large number of release sites, n, would be required to detect the PPF-induced variability in np. Further, by emphasizing the importance of individual ISIs for carrying information, we assuine reliable synaptic
82
David A. August and William B Levy
transmission (e.g., np >> 1, a n d a high safety factor for spike invasion). Therefore, an explicit conclusion is that systems with low safety factors or low release probability a n d few release sites will not use IS1 codes. Finally, we note that this study has approached neuronal information transmission differently than nature. We adjusted the stimulus bandwidth a n d amplitudes to match the spike generator. In nature, however, neurons would presumably co-evolve so that spike generators, interpolators, a n d dendritic filters matched. That is, w e expect that evolution has discovered some simple filtering a n d interpolation schemes that avoid substantial information loss.
Acknowledgments This research was supported in part by NIH GM07267 a n d MH10702 to D.A.A., a n d NIH MH00622 a n d MH48161 to W.B L., a n d EPRI RP803008 to I? Papantoni-Kazakos, a n d by the Department of Neurosurgery, University of Virginia, Dr. John A. Jane, Chairman. The authors would like to thank Steve Wilson a n d Chris Fall for their constructive comments.
References Agin, D. 1964. Hodgkin-Huxley equations: Logarithmic relation between menibrane current and frequency of repetitive activity. Nntiilz (Landoil) 201, 625626. Allen, C., and Stevens, C. E 1994. An evaluation of causes for unreliability of synaptic transmission. Pvoc. Nntl. Acad. Sci. U.S.A. 91, 10380-10383. August, D. A,, and Levy, W. 8. 1994. Information maintenance by retinal ganglion cell spikes. In The Neiirobio/ogy of Coiizpiitntioii, J. M. Bower, ed., pp. 4146. Kluwer, Norwell, MA. Bekkers, J. M., and Stevens, C. F. 1990. Presynaptic mechanism for long-term potentiation in the hippocampus. Natiive (Loiidoiz) 346, 724-729. Bialek, W., and Rieke, F. 1992. Reliability and information transmission in spiking neurons. TINS 15(11), 428-434. Bialek, W., Reike, F., and de Ruyter van Steveninck, R. 1991. Reading a neural code. Science 252, 185441857, Bialek, W., DeWeese, M., Reike, F., and Warland, D. 1993. Bits and brains: Information flow in the nervous system. Physica A 200, 581-593. Black, H. S. 1953. Moddatioii Theory. D. Van Nostrand, New York. Bower, J. M., and Beeman, D. 1995. The Book of Genesis. Springer-Verlag Telos, New York. Bullock, T. H. 1970. The reliability of neurons. I. Geii. Ph!ys. 55, 565-584. Charlton, M. P., and Bittner, G. D. 1978. Facilitation of transmitter release at squid synapses. 1. Gen. Physiol. 72, 471-486. Couch, L. W. 1987. Digital aild Aiialog Coiiziiiiinicntioii Systenis. Macmillan, New York.
Simple Spike Train Decoder
83
de Ruyter van Steveninck, R., and Bialek, W. 1988. Real-time performance of a movement-sensitive neuron in the blowfly visual system: Coding and information transfer in short spike sequences. Proc. R. SOC.London B 234, 379-414. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. I. Opt. SOC.Am. 4(12), 2379-2394. Fohlnieister, J. F., Coleman, I? A,, and Miller, R. F. 1990. Modeling the repetitive firing of retinal ganglion cells. Brain Res. 510, 343-345. Foster, T. C., and McNaughton, B. L. 1991. Long-term enhancement of CAl synaptic transmission is due to increased quantal size, not quantal content. Hippocampiis 1(1),79-91. Hessler, N. A., Shirke, A. M., and Malinow, R. 1993. The probability of transmitter release at a mammalian central synapse. Nafure (London) 366, 569-572. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. 1.Pkysiof. 117, 500-544. Jerri, A. J. 1977. The Shannon sampling theorem-its various extensions and applications: A tutorial review. Proc. IEEE 6501), 1565-1596. Katz, B., and Miledi, R. 1968. The role of calcium in neuromuscular facilitation. I . Phys. 195, 481492. Kuno, M., and Weakly, J. N. 1972. Facilitation of monosynaptic excitatory synaptic potentials in spinal motoneurones evoked by internuncial impulses. I. Pkysiol. 224, 271-286. Llinas, R., Bloedel, J. R., and Hillman, D. E. 1969. Functional characterization of neuronal circuitry of frog cerebellar cortex. I. Neuropkys. 32(6), 847-870. Marks, R. J. 1991. Introduction to Shannon Sampliiig and Interpolation Theory. Springer-Verlag, New York. Martin, A. A. 1955. A further study of the statistical composition of the endplate potential. J. Pkysiol. 130, 114-122. Miledi, R. 1967. Spontaneous synaptic potentials and quantal release of transmitter in the stellate ganglion of the squid. ].Pkysiol. 192(2), 379406. Murakami, F., Tsukahara, N., and Fujito, Y. 1977. Properties of synaptic transmission of the newly formed cortico-rubral synapses after lesion of the nucleus interpositus of the cerebellum. Exp. Brain Res. 30, 245-258. Nyquist, H. 1928. Certain topics in telegraph transmission theory. AIEE Trans. 47, 617-644. Perkel, D. H., and Bullock, T. H. 1968. Neural coding. Neurosci. Res. Prog. Bull. 6(3),227-348. Rapoport, A,, and Horvath, W. J. 1960. The theoretical channel capacity of a single neuron as determined by various coding schemes. Information Control 3,335-350. Reike, F,, Yamada, W., Moortgat, K., Lewis, E. R., and Bialek, W. 1992. Real time coding of complex sounds in the auditory nerve. Adv. Biosci. 83, 315-322. Reike, F., Warland, D., and Bialek, W. 1993. Coding efficiency and information rates in sensory neurons. Europhys. Lett. 22(2), 151-156. Rose, J. E., Brugge, J. F., Anderson, D. J., and Hind, J. E. 1967. Phase-locked re-
84
David A. August and William B Levy
sponse to low-frequency tones in single auditory n a v e fibers of the squirrel monkey. 1. Ncrrroyh!/s. 30, 769-793. Ruderman, D. L., a i d Bialek, W. 1991. Statistics of natural images: Scaling in the woods. In , 4 h i r c i ~ sirr l%wrti/ !rifc~rrrintiorrPri~ci~ssIrrg Systtwis, J. D. Cowan, G. Tesauro, and J. Alspector, eds., Vol. 6, pp. 5 3 - 3 8 , Morgan Kaufmann, San Mateo, CA. Sakuranaga, M., Ando, Y.-I., and Naka, K.-I. 1987. Dynamics of the ganglion cell response in the catfish and frog retinas. J. Geri. Phys. 90, 229-259. Salmi, T. 1983. A duration matching method for the measurement of jitter in Clirr. Neiiroplr!/siol. 56, 515-520. single fibre EMG. Elcctrcieric~~~~lirlo~r.. Shannon, C. E. 1949. Communications in tlie presence of noise. Prcic. IRE 37, 10-21. Simmons, J. A. 1979. Perception of echo phase information in bat sonar. Scirrrci204, 1336-1338. Stalberg, E., Ekstedt, J., and Broman, A. 1971. The electromyographic jitter in normal human muscles. E l c t - t ~ o c r i i r ~ ~ l r cCliri. l i i ~ ~N~irra~~iiysiol. . 31, 429438. Stein, I<. B. 1967. The frequency of nerve action potentials generated by applied currents. Pros. Royal Sos. Loridor? B 167, 61-86. Theunissen, F. E. 1993. Ari iiiwstipfictr cifsiwar!/ rodirrg pririciples iisiiig ndz~nizs~d sti~tistiiirltrclriiiqiws. P1i.D. thesis, UniLTersity of California, Berkeley, Berkeley, CA. Trontelj, J. V. 1973. A study of the H-reflex by single fibre EMG. J . Nc.riml. N t ~ ~ l l ~ s ~PS!/Clr. r r g . 36, 951-959. Trontelj, M. A,, and Trontelj, J . C’. 1978. Reflex arc of the first component of the human blink reflex: a single niotoneurone study. I. Nerrrol. Ntvirosirrx. Psych. 41, 338-517. Warland, D., Landolfa, M., Miller, J. P., and Bialek, W. 1992. Reading between the spikes in the cercal filiform hair receptors of the cricket. In Arrnlysis nrzd Moddilrg (if N m m 1 Systiws, F. H. Eeckman, ed., pp. 327-333. Kluwer, Norwell, MA. Whittaker, J. M. 1915. On the functions which are represented by the expansion of interpolating theory. Pros. M n f h . Svc. Edirihrglr 35, 181-194. Yen, J. L. 1956. On nonuniform sampling of bandwidth-limited signals. IRE Trnirr. Citr. Tlicwy CT-3, 251-257. Zucker, R. S. 1989. Short-term synaptic plasticity. Ariiirr. Rev. Ntwrosci. 12, 13-31.
This article has been cited by: 2. Sidney R. Lehky. 2010. Decoding Poisson Spike Trains by Gaussian FilteringDecoding Poisson Spike Trains by Gaussian Filtering. Neural Computation 22:5, 1245-1271. [Abstract] [Full Text] [PDF] [PDF Plus] 3. David H. Goldberg, Andreas G. Andreou. 2007. Distortion of Neural Signals by Spike CodingDistortion of Neural Signals by Spike Coding. Neural Computation 19:10, 2797-2839. [Abstract] [PDF] [PDF Plus] 4. Terence David Sanger . 1998. Probability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking NeuronsProbability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking Neurons. Neural Computation 10:6, 1567-1586. [Abstract] [PDF] [PDF Plus] 5. Giuseppe Lanzino, John A. Jane. 1998. Neurosurgery at the University of Virginia. Neurosurgery 43:1, 133-141. [CrossRef]
Communicated by Bruce McNaughton
A Model of Spatial Map Formation in the Hippocampus of the Rat Kenneth I. glum* L. F. Abbott Center for Complex Systems, Brandeis University, Waltham, M A 02254 U S A Using experimental facts about long-term potentiation (LTP) and hippocampal place cells, we model how a spatial map of the environment can be created in the rat hippocampus. Sequential firing of place cells during exploration induces, in the model, a pattern of LTP between place cells that shifts the location coded by their ensemble activity away from the actual location of the animal. These shifts provide a navigational map that, in a simulation of the Morris maze, can guide the animal toward its goal. The model demonstrates how behaviorally generated modifications of synaptic strengths can be read out to affect subsequent behavior. Our results also suggest a way that navigational maps can be constructed from experimental recordings of hippocampal place cells. Blockade of long-term potentiation (LTP) and hippocampal lesions drastically impair the ability of rodents to navigate to a goal using distal cues (Morris ef a / . 1986, 1982; Jarrard 1993; OKeefe and Nadel 1978). This has led to suggestions that the hippocampus plays a role in navigation (McNaughton et al. 1991; Worden 1992; Hetherington and Shapiro 1993; Burgess ef a / . 1994; Wan et al. 1994) by providing a cognitive map of the spatial environment (OKeefe and Nadel 1978). It has further been suggested that this cognitive map is stored by potentiated synaptic weights representing both spatial (Muller et a/. 1991; Traub et a / . 1992) and temporal (Levy 1989) correlations. In recent experiments the activity of an ensemble of place cells was decoded to reveal the location represented by their collective firing (Wilson and McNaughton 1993). The ability to decode place cell ensemble output provides an opportunity to examine ideas about spatial maps quantitatively and to explore specific mechanisms by which they can be created and stored. We examine the effect of LTP on the position encoded by place cell ensemble firing and show how it can produce a cognitive map useful for navigation. *Current address: Center for Learning and Memory and Dept. of Brain and Cognitive Sciences, Massachusetts Institute of Technology, E25-236, Cambridge, MA 02139.
Neural Computation 8, 85-93 (1996)
@ 1995 Massachusetts Institute of Technology
86
Kenneth I. Blum and L. F. Abbott
What information is stored in tlie hippocampal cognitive map, how is it stored, and how is it read out to guide navigation? We propose that this information resides in shifts of the position coded by hippocampal place cell activity that arise from synaptic potentiation. The prevalence of place cells in tlie hippocampus might lead to tlie assumption that this region primarily serves to represent the spatial location of the animal. However, knowledge of spatial location is not tlie function normally associated with a niap. Rather, a niap serves to suggest directions for future movement based on knowledge of present location. We suggest that the cognitive niap in the hippocampus plays this role. We assume that the spatial location of tlie animal is determined from sensory input outside the hippocampus, that this information is available to tlie animal, and is also transferred to the hippocampus. In our model, the role of the hippocampus is not merely to report this position, but to suggest directions of future motion on the basis of previous experience. We will first show how tlie location represented by place cell activity shifts in a direction that reflects the past experience of the animal in the environment and then discuss liow this shifted location can be compared with the present location of the animal to provide a navigational cue. Our model for tlie storage and read-out of a navigational map makes use of three key ingredients, all supported by experimental data. First, NMDA-dependent LTP i n hippocampal slices occurs only if presynnptic activity precedes postsynaptic activity by less than approximately 200 msec (Levy and Stelvard 1983; Gustafsson ci nl. 1987). Presyiiaptic activity following postsynaptic firing produces either no LTP or long-term depression (Debaiine ct nl. 1994). Second, place cells-neurons broadly tuned to location-exist in the liippocampus and make synaptic connections with each other both within the CA3 region and between CA3 and CAI (OKeefe and Dostrovsky 1971; Amaral 1987). Third, a spatial location can be determined by appropriately averaging or fitting tlie activity of an ensemble of hippocampal place cells as has been done in other systems (Georgopoulos c ~ 01. t 1986; Salinas and Abbott 1994). This coded position is near, but not necessarily identical to, the true location of the animal (Muller and Kubie 1989; Wilson and McNaughton 1993). These three observations imply that when a n animal travels through its environment causing different sets of place cells to fire, information about both temporal and spatial aspects of its motion will be reflected in changes of the strengths of synapses between place cells. Since this LTP affects subsequent place cell firing, it can shift the spatial location collectively coded by place cell activity. We compute how7 this coded location is shifted relative to the true position of the animal and find the following: (1) If LTP occurs while an animal is sitting at a specific point, the coded location is shifted toward this point. ( 2 ) If LTP occurs while an animal traverses a specific path, the coded location is shifted toward and forward along the path. (3) If many locations and paths contribute to
Spatial Map Formation in Rat Hippocampus
87
LTP, the shifts reflect the entire history of spatial exploration and provide a map of the environment useful for navigation. Shifts between the coded and actual positions arise from the following mechanism. During locomotion, cells with place fields overlapping a path being traveled are sequentially activated. A moving rat covers a few centimeters in the 200-msec time window for LTP induction. Thus, synapses from presynaptic cells with place fields overlapping a path to postsynaptic cells with fields a few centimeters forward along the path will be potentiated. Subsequently, when the animal is on the learned path, activated cells will excite neurons with place fields ahead of them along the path through the potentiated synapses. This shifts the ”center of gravity” of the firing pattern, and thus the coded location, forward along the path. Similarly synapses from cells with place fields beside the path to those on it will be potentiated, and will then shift the coded position toward the path. For details of the calculation see the section on mathematical results and Abbott and Blum (1995). These shifts suggest that an animal could navigate by heading from its present location toward the position coded by place cell activity. To illustrate both how a spatial map arises and how it can be used to guide movement, we applied these ideas to navigation in the Morris maze. We should stress that we are not modeling how a rat solves the problem of navigating to a hidden platform in this task. We are studying, instead, how the experiences of the animal while it learns the task are recorded in a cognitive map, and how this recorded information can then be used to help in performance of the task. Similarly, we do not model how place fields arise but, instead, study how they are affected by LTP. In the model, the animal was represented by a point that moved with a velocity composed of two equal parts: a component at a random angle with respect to the current velocity drawn from a uniform distribution of width one radian, and a component parallel to the LTP-induced shift of the coded position. The velocity was normalized to 20 cm/sec swimming speed. LTP occurred continually. In this form the model had mixed success. If the starting position was gradually moved away from the platform on successive runs an efficient path from any location to the platform was learned robustly. If, however, the starting location was chosen at random around the perimeter, an instability could result. A loop in the trajectory could be reinforced at a location away from the platform. There are several biologically reasonable ways to solve this problem. We chose to introduce a simple reward system. Synapses were potentiated only when the platform was reached and recently activated synapses were potentiated more strongly than synapses activated early in the trial. The amount of LTP was weighted by an exponential factor with a time constant of 4 sec. This scheme is not supposed to be a realistic description of how the animal solves the navigational task. However, it provides a simple way for the task to be solved in the model so that we can examine how the spatial map forms and what its role might be.
88
Kenneth I. Blum and L. F. Abbott
Figure I : The path iollowed an undcrlying navigational map in a simulated Morris water maze. Starting po5itions were chosen at random locations along the perimeter of the 1 m “tank.” Runs proceecled until the computed path intersected the 10-cm diameter platform or until 100 sec had elapsed. At the outer wall a n inward radial component ivas added to the velocity to siliiulate n rebound. The arrows S ~ O M ,the naiigational i m p consisting of the shifts between coded and actual positions plotted o n a grid of actual positions (see the section un mathematical results). Longer arrows have been compressed tor visual clarity. (a) The second path and navigational map of a typical run. (b) The twentieth path and navigational map of the same run. (c) The twenty-first path and navigational map oi the same run, when no platform was present. The results of a typical simulation are illustrated in Figure 1. Early in a trial, paths meander until they get close t o the target. Ultimately, the approach to the platform is quite direct a n d efficient. Figure l c shows a simulation of the transfer test in which the platform is removed following training. The path goes directly to the region where the platform had been a n d stays in that vicinity.
Spatial Map Formation in Rat Hippocampus
89
Trial Figure 2: The escape times averaged over 40 runs as a function of trial number. Each run consisted of 20 trials as in Figure 1. No decrease in latency was observed when simulated LTP or guidance by the navigational map was removed (results not shown). The navigational map that guides the motion in these simulations evolves as a function of the trajectories followed. The map at various stages of the simulation is shown by the arrows in Figure 1. Initially, all the arrows have zero length since no LTP-induced shift has yet occurred. After a few trials, the arrows point toward the goal, but only over a limited range of the environment near the platform. As subsequent paths lead toward the platform, the range of this directed map extends outward. In Figure lc all the arrows point roughly toward the platform indicating that hippocampal place cell activity can provide the information needed to find the platform from any initial location. Figure 2 shows how the average time required to find the platform decreases as a function of trial number. There is a decrease in latency over the first 10 trials followed by consistently efficient platform finding behavior. These results resemble the performance of normal animals (Morris et al. 1982, 1986). We propose that the difference between ensemble coded position and actual location acts as a cognitive map capable of guiding navigation through the environment. We speculate that this could occur in the following way. Information about the actual location of the animal enters the hippocampus from the entorhinal cortex passing to both CA3 and CA1 regions. The shifted representation that develops due to LTP between place cells is likely to arise in CA3 with its extensive recurrent collaterals. Since CA1 receives input from both CA3 and directly from entorhinal cortex, it could simultaneously represent both the true location and the shifted position. From such a distributed, dual representation it
90
Kenneth I. Blum and L. F. Abbott
is possible to extract the difference, and this could guide the direction of locomotion (Andersen t7t 171. 1985; Zipser and Andersen 1988; Salinas and Abbott 1994). This idea is completely consistent with the presence of place cells in CA1 but Ivould predict that their actility depends on both the actual location of the animal and on the location coded in CA3. Unfortunately, since these two locations cannot easily be varied independently, it will be difficult to establish lvhether or not a dual representation exists in CAI. Howeiw, if it does, then a downstream network can extract the vector difference needed to guide navigation (Salinas and Abbott 1995). Of course, the comparison of the coded and actual positions could just as ~zelltake place anywhere along the pathway from CA3 output to motor acti\+ty (Muller and Kubie 1989). Asymmetric synaptic weights develop in our model because of the temporal asymmetry of LIP induction and because place fields are activated sequentially during locomotion. The phase dependence of place cell firing with respect to theta oscillations reported by O’Keefe and Recce (1993) may also play a role. Within each theta cycle, activated place cells tend to fire in the order that the animal encountered their place fields. If we assume that the rvindow for LTP induction does not extend from one theta cycle to the next, this will contribute to the asymmetric potentiation that is central to our model and to the resulting shifts of the coded location in the forward direction along a path. Additional, more complex contributions will arise if the LTP window extends across theta cycles (O’Keefe and Recce 1993). Our results show that information about trajectory history can be stored in synaptic strengths and that it can be communicated to subsequent networks through changes in the overall pattern of neuronal activity. Our model is not meant to be a complete description of the mechanisms by which a rat navigates, but one element of a navigational system. This element is particularly interesting because we can relate it to specific neuronal and molecular mechanisms: place cell ensemble coding and properties of the NMDA receptor (Lester r t d . 1990; Hestrin f t n l . 1990; Jahr and Stevens 1990) that give rise to temporally asymmetric LTP (Levy and Steward 1983; Gustafsson e t a ! . 1987). The changes in place cell activity that we have computed arise inevitably if NMDA-mediated LTP occurs between place cells during locomotion. The model predicts that the position of an animal that is moving toward a goal should lag the location decoded from place cell activity. Although we are unable to predict the magnitude of this effect precisely, we expect that it is smaller than the reported 5 cm tracking uncertainty (Wilson and McNaughton 1993). Muller and Kubie (1989) report that place fields are best described as centered on future location and speculate about a navigational role for this effect. Direct experimental tests of our model are feasible. Individual place fields should be altered by learning paths through the environment and the shifts in the location decoded from place cell ensembles may be observable. For example, place fields lying along a path that is frequently
Spatial Map Formation in Rat Hippocampus
91
traversed in one direction should elongate and move backward along the path. Furthermore, it has recently been shown that place cell pairs correlated during behavior have enhanced correlation during subsequent slow-wave sleep (Wilson and McNaughton 1994). The asymmetric, shortlatency, pair-correlation functions measured in the hippocampus during sleep may reflect synaptic weights because the inputs to the hippocampus are less correlated during slow wave sleep than during behavior. The correlation matrix and measured average firing rates can be used to generate navigational maps like those of Figure 1 directly from experimental data using equation 1.3 below. The existence of a bias in the shift arrows toward a goal location would provide dramatic evidence of a navigational map in the hippocampus consistent with our model. 1 Mathematical Results
Let H(t’) represent the rate of LTP induction for unit firing rates when presynaptic activity precedes postsynaptic activity by a time t’. If LTP occurs during motion along a path X(t) the strength of the synapse from place cell j to place cell i is enhanced by
nw,, =p
t ‘ H(t’)f,[X(t + t ’ ) ] f i [ X ( t ) ]
(1.1)
where f l is the average firing rate of place cell i as a function of position. This equation allows the synaptic weights to grow without bound but in the simulations we constrained the weights so that the resulting shift is less than unit magnitude. After LTP the firing rate of place cell i is Y, = fl(x) + AW,,fi(x) when the animal is at location x. The coded position is given by
c,
(1.2) where sI is the center of the place field for cell i. From this we find
To evaluate this expression we use gaussians with width 20 and height R,, for the fl, replace sums over place cells with integrals over their place field centers, and use the approximation fi [ X ( t + t’)] zz fi [X(t)J + t ’ X ( t ) . Vfi [X(t)] . This gives the result
with h the integral of H , p the place field density, 7 the average LTP window time (the first moment of H ) , and X the velocity of the learned path. Other decoding methods (Salinas and Abbott 1994) give similar results. For the figures, we integrated equation 1.4 numerically with h7rR$,,pa2 = 0.4, 0 = 7 cm, and r = 200 msec.
92
Kenneth I. Bluni and L. F. Abbott
Acknowledgments Supported by NSF-DMS9208206, the W. M. Keck Foundation, a n d the McDonnell-Pew Centre for Cognitive Neuroscience at Oxford (L.A.)a n d NIH-NS07292 (K. B.). We thank Marco Idiart a n d Eve Marder for discussions.
References
-
Abbott, L. F., and Blum, K. I. 1995. Functional significance of long-term potentiation for sequence learning and prediction. Ccrebrnl Cortex (in press). Aniaral, D. 1987. Memory Anatomical organization of candidate brain regions. In H~iiidlJO~~kOfP/i!/Si(~IoS!/ Ser. 1 T/ir Nrriioiis System, F. Plum, ed., Vol. v, pp. 211-294. Oxford University Press, New York. Andersen, R. A., Essick, G. K., and Siegel, R. M. 1985. The encoding of spatial location by posterior parietal neurons. Sc-iwce 230, 456. Burgcss, N., Recce, M., and O’Keefe, J. 1994. A model of hippocampal function. NPiirnl Nctit,orkS 7, 1065-1081 Debanne, D., Cahwiler, B. H., and Tliompson, S. M. 1994. Asynchronous preand postsynaptic activity induces associative long-term depression in area CA1 of the rat hippocampus 111 iitro. Pros. Not/. A c d . Sci. L1.S.A. 91, 11481152. Georgopoulos, A. P., Scliwartz, A,, and Kettner, R. E. 1986. Neuronal population coding of movement direction. Scirrice 233, 1416-1419. Gustafsson, B., Wigstrom, H., Abraham, W. C., and Huang, Y. -Y. 1987. Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to 5ingle volley synaptic potentials. Ncirrosci 7, 774-780 Hestrin, S., Sah, P., and Nicoll, R. A. 1990. Mechanisms generating the time course of dual component excitatory synaptic currents recorded in hippocarnpal slices. Ncirrori 5, 247-253. Hetherington, P. A., and Shapiro, M. L. 1993. A simple network model simulates hippocampal place fields: 11. Computing goal-directed trajectories and menwry fields. Bclioil. Ntv~ros15.107, 434. Jahr, C. E., and Stevens, C. F. 1990. A quantitative description of NMDA receptor channel kinetic behavior. I . Nrrrrosii. 10, 18313-1837. Jarrard, L. E. 1993. On the role of the hippocampus in learning and memory in the rat. Bclio~t.Nrirrnl Bicil. 60, 9. Lester, R. A. J., Clements, J. D., Westbrook, G. L., and Jahr, C. E. 1990. Channel kinetics deterniine the time course o f NMDA receptor-mediated synaptic currents. Nntrrri?(Lorrdoi?) 346, 565-567. Lev);, W. B. 1989. A computational approach to hippocampal function. In C(Jl?lplltnfiOiin/Models of Lenrriiiig iri Sirirplc N i ~ i ~ rSystems, x/ R. D. Hawkins and G. H. Bower, eds., pp. 243-305. Academic Press, San Diego, CA. Levy, W. B., and Steward, D. 1983. Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neitroscieiice 8, 791-797.
Spatial Map Formation in Rat Hippocampus
93
McNaughton, B. L., Chen, L. L., and Markus, E. J. 1991. ”Dead reckoning,” landmark learning, and the sense of direction: A neurophysiological and computational hypothesis. J. Cogn. Neurosci. 3, 190. Morris, R. G. M., Schenk, F., Garrud, P., Rawlins, J. N. P., and OKeefe, J. 1982. Place navigation impaired in rats with hippocampal lesions. Nature (London) 297, 681. Morris, R. G. M., Anderson, E., Lynch, G. S., and Baudry, M. 1986. Selective impairment of learning and blockade of long-term potentiation by an Nmethyl-D-aspartate receptor antagonist, AP5. Nature (London) 319, 774-776. Muller, R. U., and Kubie, J. L. 1989 The firing of hippocampal place cells predicts the future position of freely moving rats. J. Neurosci. 9, 41014110. Muller, R. U., Kubie, J. L., and Saypoff, R. 1991. The hippocampus as a cognitive graph. Hippocampus 1, 243-246. OKeefe, J., and Dostrovsky, J. 1971. The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Xes. 34, 171-1 75. OKeefe, J., and Nadel, L. 1978. The Hippocampus as a Cognitive Map. Clarendon, London. O’Keefe, J., and Recce, M. L. 1993. Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus 3, 317-330. Salinas, E., and Abbott, L. F. 1994. Vector reconstruction from firing rates. I. Computational Neurosci. 1, 89-107. Salinas, E., and Abbott, L. F. 1995. Transfer of information between sensory and motor networks. J. Neurosci. (in press). Traub, R. D., Miles, R., Muller, R. U., and Gulyas, A. I. 1992. Functional organization of the hippocampal CA3 regions: Implications for epilepsy, brain waves and spatial behavior. Nekvork 3, 465. Wan, H. S., Touretzky, D. S., and Redish, A. D. 1994. Towards a computational theory of rat navigation. In Proceedings of the 1993 Connectionist Models Summer School, M. C. Mozer, P. Smolensky, D. s. Touretzky, J. L. Elman, and A. S. Weigend, eds., pp. 11-19. Lawrence Erlbaum, Hillsdale, NJ. Wilson, M. A., and McNaughton, B. L. 1993. Dynamics of the hippocampal ensemble code for space. Science 261, 1055-1058. Wilson, M. A., and McNaughton, B. L. 1994. Reactivation of hippocampal ensemble memories during sleep. Science 265, 676-679. Worden, R. 1992. Navigation by fragment fitting: A theory of hippocampal function. Hippocampus 2, 165. Zipser, D., and Andersen, R. A. 1988. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature (London) 331, 679.
Received November 30, 1994; accepted April 3, 1995.
This article has been cited by: 2. Myoung Won Cho, M. Y. Choi. 2010. Brain networks: Graph theoretical analysis and development models. International Journal of Imaging Systems and Technology 20:2, 108-116. [CrossRef] 3. Chun-Chung Chen, David Jasnow. 2010. Mean-field theory of a plastic network of integrate-and-fire neurons. Physical Review E 81:1. . [CrossRef] 4. Laura Lee Colgin, Tobias Denninger, Marianne Fyhn, Torkel Hafting, Tora Bonnevie, Ole Jensen, May-Britt Moser, Edvard I. Moser. 2009. Frequency of gamma oscillations routes flow of information in the hippocampus. Nature 462:7271, 353-357. [CrossRef] 5. J. Lisman, A.D. Redish. 2009. Prediction, sequences and the hippocampus. Philosophical Transactions of the Royal Society B: Biological Sciences 364:1521, 1193-1201. [CrossRef] 6. Yanhua Ruan, Gang Zhao. 2009. Comparison and Regulation of Neuronal Synchronization for Various STDP Rules. Neural Plasticity 2009, 1-13. [CrossRef] 7. Benjamin D. Dalziel, Juan M. Morales, John M. Fryxell. 2008. Fitting Probability Distributions to Animal Movement Trajectories: Using Artificial Neural Networks to Link Distance, Resources, and Memory. The American Naturalist 172:2, 248-258. [CrossRef] 8. Edvard I. Moser, Emilio Kropff, May-Britt Moser. 2008. Place Cells, Grid Cells, and the Brain's Spatial Representation System. Annual Review of Neuroscience 31:1, 69-89. [CrossRef] 9. Natalia Caporale, Yang Dan. 2008. Spike Timing–Dependent Plasticity: A Hebbian Learning Rule. Annual Review of Neuroscience 31:1, 25-46. [CrossRef] 10. Torkel Hafting, Marianne Fyhn, Tora Bonnevie, May-Britt Moser, Edvard I. Moser. 2008. Hippocampus-independent phase precession in entorhinal grid cells. Nature 453:7199, 1248-1252. [CrossRef] 11. Chris M. Bird, Neil Burgess. 2008. The hippocampus and memory: insights from spatial processing. Nature Reviews Neuroscience 9:3, 182-194. [CrossRef] 12. Stijn Cassenaer, Gilles Laurent. 2007. Hebbian STDP in mushroom bodies facilitates the synchronous flow of olfactory information in locusts. Nature 448:7154, 709-713. [CrossRef] 13. Hiroaki Wagatsuma, Yoko Yamaguchi. 2007. Neural dynamics of the cognitive map in the hippocampus. Cognitive Neurodynamics 1:2, 119-141. [CrossRef] 14. Andrew G. Howe, William B. Levy. 2007. A hippocampal model predicts a fluctuating phase transition when learning certain trace conditioning paradigms. Cognitive Neurodynamics 1:2, 143-155. [CrossRef] 15. Josef Bischofberger, Dominique Engel, Michael Frotscher, Peter Jonas. 2006. Timing and efficacy of transmitter release at mossy fiber synapses in the
hippocampal network. Pflügers Archiv - European Journal of Physiology 453:3, 361-372. [CrossRef] 16. Loren M. Frank, Emery N. Brown, Garrett B. Stanley. 2006. Hippocampal and cortical place cell plasticity: Implications for episodic memory. Hippocampus 16:9, 775-784. [CrossRef] 17. James J. Knierim, Inah Lee, Eric L. Hargreaves. 2006. Hippocampal place cells: Parallel input streams, subregional processing, and implications for episodic memory. Hippocampus 16:9, 755-764. [CrossRef] 18. G. Wallis. 2005. Stability Criteria for Unsupervised Temporal Association Networks. IEEE Transactions on Neural Networks 16:2, 301-311. [CrossRef] 19. Bernd Porr, Florentin Wörgötter. 2005. Inside embodiment – what means embodiment to radical constructivists?. Kybernetes 34:1/2, 105-117. [CrossRef] 20. Michael E. Hasselmo. 2005. What is the function of hippocampal theta rhythm?—Linking behavioral data to phasic properties of field potential and unit recording data. Hippocampus 15:7, 936-949. [CrossRef] 21. Michaël B Zugaro, Lénaïc Monconduit, György Buzsáki. 2005. Spike phase precession persists after transient intrahippocampal perturbation. Nature Neuroscience 8:1, 67-71. [CrossRef] 22. Naoyuki Sato, Yoko Yamaguchi. 2005. Online formation of a hierarchical cognitive map for object-place association by theta phase coding. Hippocampus 15:7, 963-978. [CrossRef] 23. Hiroaki Wagatsuma , Yoko Yamaguchi . 2004. Cognitive Map Formation Through Sequence Encoding by Theta Phase PrecessionCognitive Map Formation Through Sequence Encoding by Theta Phase Precession. Neural Computation 16:12, 2665-2697. [Abstract] [PDF] [PDF Plus] 24. Hidenori Watanabe, Masataka Watanabe, Kazuyuki Aihara, Shunsuke Kondo. 2004. Change of memory formation according to STDP in a continuous-time neural network model. Systems and Computers in Japan 35:12, 57-66. [CrossRef] 25. Alessandro Treves. 2004. Computational constraints between retrieving the past and predicting the future, and the CA3-CA1 differentiation. Hippocampus 14:5, 539-556. [CrossRef] 26. Philip Seliger, Lev Tsimring, Mikhail Rabinovich. 2003. Dynamics-based sequential memory: Winnerless competition of patterns. Physical Review E 67:1. . [CrossRef] 27. M. R. Mehta, A. K. Lee, M. A. Wilson. 2002. Role of experience and oscillations in transforming a rate code into a temporal code. Nature 417:6890, 741-746. [CrossRef] 28. A. P. Shon, Wu, D. W. Sullivan, W. B Levy. 2002. Initial state randomness improves sequence learning in a model hippocampal network. Physical Review E 65:3. . [CrossRef]
29. Ole Jensen . 2001. Information Transfer Between Rhythmically Coupled Networks: Reading the Hippocampal Phase CodeInformation Transfer Between Rhythmically Coupled Networks: Reading the Hippocampal Phase Code. Neural Computation 13:12, 2743-2761. [Abstract] [PDF] [PDF Plus] 30. John E. Lisman, Nonna A. Otmakhova. 2001. Storage, recall, and novelty detection of sequences by the hippocampus: Elaborating on the SOCRATIC model to account for normal and aberrant effects of dopamine. Hippocampus 11:5, 551-568. [CrossRef] 31. Phillip J. Best, Aaron M. White, Ali Minai. 2001. SPATIAL PROCESSING IN THE BRAIN: The Activity of Hippocampal Place Cells. Annual Review of Neuroscience 24:1, 459-486. [CrossRef] 32. Walter Senn , Henry Markram , Misha Tsodyks . 2001. An Algorithm for Modifying Neurotransmitter Release Probability Based on Pre- and Postsynaptic Spike TimingAn Algorithm for Modifying Neurotransmitter Release Probability Based on Pre- and Postsynaptic Spike Timing. Neural Computation 13:1, 35-67. [Abstract] [PDF] [PDF Plus] 33. Paul Rodriguez, William B. Levy. 2001. A model of hippocampal activity in trace conditioning: Where's the trace?. Behavioral Neuroscience 115:6, 1224-1238. [CrossRef] 34. Simona Doboli , Ali A. Minai , Phillip J. Best . 2000. Latent Attractors: A Model for Context-Dependent Place Representations in the HippocampusLatent Attractors: A Model for Context-Dependent Place Representations in the Hippocampus. Neural Computation 12:5, 1009-1043. [Abstract] [PDF] [PDF Plus] 35. D.J. Foster, R.G.M. Morris, Peter Dayan. 2000. A model of hippocampally dependent navigation, using the temporal difference learning rule. Hippocampus 10:1, 1-16. [CrossRef] 36. Matthew L. Shapiro, Howard Eichenbaum. 1999. Hippocampus as a memory map: Synaptic plasticity and memory encoding by hippocampal neurons. Hippocampus 9:4, 365-384. [CrossRef] 37. Asohan Amarasingham , William B. Levy . 1998. Predicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction ModelPredicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction Model. Neural Computation 10:1, 25-57. [Abstract] [PDF] [PDF Plus] 38. A. David Redish , David S. Touretzky . 1998. The Role of the Hippocampus in Solving the Morris Water MazeThe Role of the Hippocampus in Solving the Morris Water Maze. Neural Computation 10:1, 73-111. [Abstract] [PDF] [PDF Plus]
39. Vikaas S. Sohal, Michael E. Hasselmo. 1998. GABAB modulation improves sequence disambiguation in computational models of hippocampal region CA3. Hippocampus 8:2, 171-193. [CrossRef]
Communicated by Laurence Abbott
A Neural Model of Olfactory Sensory Memory in the Honeybee’s Antenna1 Lobe
We present a neural model for olfactory sensory memory in the honeybee’s antennal lobe. To investigate the neural mechanisms underlying odor discrimination and memorization, we exploit a variety of morphological, physiological, and behavioral data. The model allows us to study the computational capacities of the known neural circuitry, and to interpret under a new light experimental data on the cellular as well as on the neuronal assembly level. We propose a scheme for memorization of the neural activity pattern after stimulus offset by changing the local balance between excitation and inhibition. This modulation is achieved by changing the intrinsic parameters of local inhibitory neurons or synapses. 1 Introduction
-..
Honeybee foraging behavior is based on discrimination among complex odors, which is the result of a memory process involving extraction and recall of ”key-features” representative of the plant aroma (for a review see Masson r t 01. 1993). The study of the neural correlates of such mechanisms requires a determination of how the olfactory system successively analyzes odors a t each layer of the network [namely, receptor cells, antennal lobe interneurons and glomeruli, mushroom bodies (Fig. l)],and then a comparison of the successive ”olfactory images.” Thus far, all studies suggest the implication of both antennal lobe and mushroom bodies in these processes. The specific associative components of the olfactory memory trace would be located in tlie mushroom bodies, whereas the antennal lobe would be the location of noise reduction and feature extraction performed on the essentially unstable and fluctuating olfactory signal as transmitted by the receptor cells. I’re~entaddress: Department of Psychology, Rm 1446, I larvard University, 33, KirkStreet, Cambridge, MA 02138.
laitd
.\‘c2u~d ~ f ~ f l f ~ ~ f f t l 78,~ l94-114 (Jl~ (1996)
c, 1995 Massachusetts Institute of Technology
Model of Olfactory Sensory Memory in the Honeybee
95
Figure 1: Schematic representation of the organization of the bee antennal 01factory pathway. The olfactory pathway is essentially composed of three layers of processing: the receptor cell layer (RC), the antennal lobe layer (AL), and the mushroom bodies (MB). Feedforward connections exist between RC and AL layers and between AL and MB layers (see Masson et al. 1993).
To investigate how and where odor signals may be transformed to produce codes that are suitable for storage in memory, we have recently proposed a neural model that exploits a huge variety of data related to the antennal lobe, including morphological, physiological as well as modeling results (Fonta et al. 1993; Masson ef al. 1993; Sun ef al. 1993; Kerszberg and Masson 1995). Our model (Linster et al. 1994; Linster and Masson 1994; Masson and Linster 1995) permits us to study the computational capacities of the neural circuitry in the antennal lobe and to investigate a number of features concerning odor discrimination and feature detection in the antennal lobe layer. Associative learning experiments in honeybees using olfactory clues (Menzel 1983) have shown that CS (conditioned stimulus)-US (unconditioned stimulus) associations are established even under conditions when the CS is turned off for a few seconds before onset of the US. Menzel (1983) therefore concludes that the olfactory stimulus (CS) is kept in a temporary sensory store for later association with the US. We propose to investigate the possible neural mechanisms underlying this temporary sensory store (which we call sensory memory), based on the idea that (1)it is an active mechanism, (2) it is situated in the first layer of olfactory processing, the antennal lobe, and (3) it involves no localized synaptic changes in the antennal lobe neural network. In agreement with several more abstract proposals for short-term memory implementation in neural networks (Abott 1990; Zipser 1991), we propose a modulation of the
96
Christiaiie Linster and Claudine Masson
neuronal dynamics that allows us to store a memory trace of the neural activities in the network even after stimulus offset.
2 Olfactory Circuitry and Odor Processing in the Model
The model is essentially built to investigate the computational capacities of the neural circuitry in the antennal lobe, to interpret the role of different classes of interneurons that have been morphologically described (Fonta et nl. 1993 and Fig. 21, and to make predictions about odor decoding and feature detection at this stage of the olfactory network. In the following, we will describe the different neuron classes in the antennal lobe and their representation in the model (see Figs. 1 and 31, as well as the odor processing performed by the model. In the honeybee, as postulated in insects and in vertebrates for nonpheromonal stimuli (e.g., food odors), due to a certain overlap of receptor i d responses (each neuron responding with a different degree to a range of odor molecules), the peripheral representation of an odor stimulus is mainly represented in an across fiber code (Vareschi 1971; Akers and Getz 1993). Vareschi (1971) proposed a classification of receptor cells into 10 functional groups, based on their responses to pure odorants; the spectra of responses of these 10 groups have been shown to be less overlapping than the response spectra within groups. In the model, we introduce 5 types of r ~ c e t ~ t occlls r with overlapping molecule spectra, each representing a large number of receptor cells with similar molecular sensitivity. Pure and mixed odors are modeled by way of 10 different molecule classes. The synaptic contacts between sensory neurons and antenna1 lobe neurons, as well as the synaptic contacts between antennal lobe neurons, are localized in areas of high synaptic density (Gascuel and Masson 19911, the nrzterrrznl lobe ~lorrreruli;each glomerulus representing an iderztifiahlc morpizologicn/ iimroyilnr wbrrrrif (165 for the worker honeybee) (Arnold r t d.1985). Mobbs (1984) has shown that receptor cell axons generally terminate in a single glomerulus. In the model, the number of glomeruli is reduced to 15, each representing a glomerular region that receives input from the same receptor cell type. Local i!zfn.iiezrrcm (LIN) constitute the majority (90%)of antennal lobe neurons intracellularly sampled, and there is evidence that a majority of them are irzlzibitory (Sun ~t a / . 1993). Generally, receptor cells are supposed to synapse mainly with LINs; however, it has been shown in the cockroach that receptor cells synapse on nonidentified neural elements that are not GABA responsive (Kirn and Boeckh 1994) and that such neural elements synapse on output neurons (Malun 1991a,b). In addition to this, the high level of excitation on the responses of output neurons (ON) to olfactory stimulation suggests that local excitation (spiking or nonspiking
Model of Olfactory Sensory Memory in the Honeybee
Local interneurons
-
97
Output neurons
A HOMO LIN
C-UNION
B - HETFRO LIN
D - PLURI ON
Figure 2: Schematic representation of the four main types of identified antennal lobe neurons morphologically identified in the bee. Local interneurons: (A) Homo LIN (not considered in the simulations presented here): homogeneous neuron branching in many glomeruli; (B) Hetero LIN (localized interneuron in the model): also pluriglomerular with a high dendritic arborization in one particular glomerulus. Output neurons: (C) Uni ON (localized output neuron in the model): the axon of which conveys the processed information from only one glomerulus; (D) Pluri ON (not represented in the model): pluriglomerular. From Sun et al. (1993), with permission of Chemical Senses. local interneurons, or modulation of the local excitability, this remains an open question) also exists. All LIN are pluriglomerular, but the majority of them, namely Hetero LIN, differ from the others, Homo LIN, by a high density of branching in one particular glomerulus (Fig. 2). Morphological studies of their dendritic arborizations (Fonta et al. 1993) suggest, in comparison with precise studies of synaptic distributions on dendritic arborizations of antenna1 lobe neurons in the cockroach (Malun 1991a,b), that Hetero LINs have input and output synapses in their principal glomerulus, whereas they might have mainly output synapses in all other glomeruli they invade.
98
Chnstiane Liiister and Claudine Masson
In the simulations presented here, only loc-nlizrd ii7fc~rricirroiis(which correspond to Hc>fmLINs) are represented. The simulations have proven that an excitatory element local to each glomerulus is necessary to account for the neural response patterns observed in intracellular recordings: we thus introduce a class of nonspiking ci.v-cifnfor-!/ l o c o l i m l i i t~e r i i < 3 ~ i r o ithat i~ have dendritic arborizations (input and output synapses) restricted to one glomerulus. l i i h i h f o r y locnli;rd iiiterizrirraiis have a dense arborization (input and output synapses) in orzc glomerulus and sparse arborizations in nll others; as we have shown before (Linster rf a/.1994), these distributed synapses should be mainly output synapses to guarantee a lateral inhibition mechanism. The synaptic coefficients of these lateral inhibitory connections decrease linearly with the distance between the point of summing of actiVation (the principal glomerulus of the LIN) and the glomeruli with which the interaction takes place; similarly, the transmission delays o f these interactions increase with the distance of the interactions. Similarly to the LINs, a part of the ONs have dendrites invading only one glomerulus (Lliii O N ) ,whereas the others (Plzrr-iO N ) are pluriglomerular. The axons of both types of ONs project to various areas of the protocerebrum, especially onto the mushroom body interneurons (Fonta rt n / . 1993). In the model, only / o c L ~ / oIi ~i t pLi r~f iiciiroiis, which correspond to Llrii ONs, are represented; they connect only to local interneurons, and do not receive direct input from receptor cells. Thus, each glomerulus integrates 1. input from one type of receptor cell, 2 . local excitation provided by its local excitatory interneuron, 3. local inhibition provided by its associated inhibitory interneuron, md 4.lateral inhibition coming from neighboring glomeruli provided by inhibitory interneurons associated to the neighboring glomeruli. Striiitrlntion of the receptor cells by a subset of the 10 molecules triggers several phenomena:
1. due to the excitatory elements (which feed back onto each other) local to each glomerulus, an activated glomerulus tends to enhance the activation it receilres from the receptor cells, 2. the local inhibitory elements are activated (with a certain delay) by the receptor cell activity and by the self-activation of the local excitatory elements, and
3 . due to the lateral inhibitory connections, these tend to inhibit neighboring glomeruli.
These phenomena result in a coiizyetitiori between active glomeruli: during a number of sampling steps, the output activity of each glomerulus (represented by the firing probabilities of its associated output neu-
Model of Olfactory Sensory Memory in the Honeybee
99
p c Molecule spectra
M
Figure 3: Organization of the model olfactory circuitry. In the model, we introduce types of receptor cells with overlapping molecule (M) spectra; each receptor cell type has its maximal spiking probability P for the presence of one molecule i. The axons of the different receptor cell types project into distinct regions of the glomerular layer. All allowed connections (as described in the text) exist with the same probability, but with different connection strengths and transmission delays. The output of each glomerulus is represented by its associated output neurons. (For simulation parameters, see the Appendix.) rons) oscillates from high activity to low activity. Due to the competition provided by the lateral inhibition, the spatial activity pattern in the glomerular layer changes over time, and a stable activity map is reached eventually. A number of glomeruli "win" and stay active, whereas others "lose" and are silent. Figure 4A shows the evolution of the glomerular activities after odor presentation. For each sampling step of 2 msec, the average firing probabilities of the output neurons associated with each of the glomeruli are shown. The last pattern shows the stabilized activity map resulting from the competition between close glomeruli.
Christiane Linster and Claudine Masson
100
Glomeruli 1 15 ~
Stabiiizeapattern
Firing probabilities
Figure 4: Odor processing in the modcl. ( A ) Stabilization of spatial activity patterns. For a number of sampling steps (2 msec), the average firing probabilities (Lwies between 0 and 1) of the ONs associated to each glomerulus are shown. After stin~ulationall glomeruli are differentially activatcd. Lateral inhibition silences all OIL'Sat step 2 on the presented diagram. During the next sampljng steps, the cornpetitinn between glomeruli due to the lateral inhibition and to the local cxcitation can be observed. Around step 9, the final activity pattern begins to emerge. The last pattern shows the stabilized activity pattern wliich resiilts from the stiniulation. (B) Evolution of the firing probabilities of inciiviciual output neurons after stimulation. As an illustration, the temporal evolution of the average firing probabilities (ranging from zero to one) of the ONs associated to some glomeruli are traced. After stabilization of the activity map, ONs are either silent or active. After stimulus offset, the activities recover their spontaneous activation level after ca. 15 msec. (Stimulus onset and offset are indicated by arrows.)
Model of Olfactory Sensory Memory in the Honeybee
101
The neural code read by the next layer (e.g., the mushroom bodies) of the olfactory network is represented by the across fiber pattern of the activities of the output neurons. The activities of individual output neurons follow the general pattern described above: oscillation of the activity during a number of sampling steps until the activity "settles" down to a stable value [Fig. 4B shows the firing probabilities of some output neurons in response to the same stimulation as that used in Fig. 4A; Fig. 5a shows the output activity (action potentials and membrane potential) of several LIN and ONs in response to different odor stimuli]. A stable activity can either be a constant firing probability or a "stable" oscillation of the firing probability. An output neuron associated to a particular glomerulus may be active in response to a particular odor quality, and silent for others. The simulations suggest two main conclusions concerning the neural circuitry in the antenna1 lobe: (1) local excitation should be present in the glomeruli and (2) the particular, heterogeneous arborization pattern of the localized interneurons is closely related to their functional role: to provide lateral inhibition between glomeruli. To ensure the stability of the system (i.e., the network can be activated only by external input, not by its internal noise), a basic condition has to be observed: the sum of excitation and inhibition arriving at the excitatory interneuron from other interneurons is lower than its saturation threshold. After stabilization of an activity pattern in response to stimulation, a maximal signal-to-noise ratio can be obtained if the total input to activated interneurons exceeds their saturation threshold; in this case all activated interneurons fire at their maximal frequency (see the Appendix, Section 4 for derivation). For adjustment of parameters, we use two scales of observation: 0
0
Activities of individual neurons can be compared to intracellular recordings (spontaneous activities, response latencies, average firing frequencies) (Sun et al. 1993); this permits us to adjust intrinsic parameters such as membrane time constants, spiking thresholds, and synaptic transmission delays (Fig. 5a). Statistical distributions of simple neural response patterns (excitation, inhibition, or no-response) in response to stimulations can be compared to quantitative descriptions of electrophysiological data (Sun et al. 1993); this permits us to adjust "wiring" parameters such as connection strengths and the decay of connection effects with respect to distance of signal transmission (see Fig. 5b for details).
The phenomena described above arise for a large scale of parameters (observing the conditions described in the Appendix, Section 4), however, the average number of neurons that is active or inhibited for each stimulation pattern is determined by the balance between local excitation and lateral inhibition in each glomerulus. These phenomena are described in more detail in Masson and Linster (1995) and Linster et al. (1994). In the simulations described here, the average values of the set of parameters
102
Christiane Linster aiid Claudine Massoii
Figure Ja: Validation of parameters by comparison to experimental data. Statistical distribution of response patterns to olfactory stimulation in niodcl neurons (11) and antenna1 lobe iieurons (U). A: In the 10-dimensioIial odorant space, all combinations of binary odors (102-1) were used. For each stimulus (625 msec simulation time), the membrane potentials of LINs and ONs associated to each of tlie 1.5 glomeruli were averaged. For LlN and ON populations spontaneous activity was averaged over 623 msec ivith no stimulation, and all response amplitudes Ivere iiornialized Irith respect to the maximal response amplitude (excitation and inhibition considered separately) obtained for all stimulations. A significant excitatory response was detected when the increase of activation bvith rrspect to the spontaneous activitv exceeded 10'4 ; a significant inhibitory response was detected when the decrease of activation with respect to spontaneous activity exceeded 10';5 . The graph shows tlie average number of excitatory, inhibitory, and nonresponse patterns for both LINs and ONs for all stimulatictns. 8: Percentage of nonresponse, excitation and inhibition patterns recorded froiii two classes of antenna1 lobe neurons (Hetero LlN aiid ON) (Sun ct (71. 1993). Tliesc results represent a quantitati\re representation of data obtained by intracellular recordings of 1-11 antenna1 lobe neurons, with 7 different stimulations (as described in Fig. 8) fur each recording.
derived by comparison to the experimental ddta are constant; all real values are chosen in a random distribution (110% around the average value). We will show in the next section how modulation of the lateral inhibition strength provides a simple a n d efficient scheme for sensory memory:
Model of Olfactory Sensory Memory in the Honeybee
103
Figure 5b: Validation of parameters by comparison to experimental data. Individual neural activities of neurons in the model (A) and antennal lobe neurons (8).A: Individual LIN (1-2) and ON (1-2) activities in response to two different odor stimulations (01and 02). Stimulus onset and offset are indicated by arrows. Local interneurons are mainly activated (increased spiking frequency) by stimulation; their temporal response patterns vary with the stimulation, Output neurons are either activated or inhibited by stimulation. In the model, mean background activities (in absence of stimulation) are 12.5 spikes/sec for ONs and 4.2 spikes/sec for LINs. B: Intracellular recording from Hetero LIN responding to pure components and their binary and ternary mixtures with varying response profiles (Sun et al. 1993). Mean background activities recorded are 5.9 spikes/sec for Hetero LINs and 12 spikes/s for ONs. the neural activities in the model may memorize the activity pattern even after stimulus offset for a short period of time.
3 Modeling Sensory Memory: A Problem of Modulation of Inhibition in the Network In a foraging situation, a trace of the olfactory stimulus has to be established until either a positive (food is found) or a negative (no food is found) reinforcement stimulus is received. Here, we predict that this trace can be established in the antennal lobe by memorizing the neural activity pattern triggered by the stimulus even after stimulus offset. This memorization of the neural activity pattern is achieved, in the model, by modulation of the lateral inhibition strength between glomeruli, e.g., by perturbation of the local balance between excitation and inhibition: 1. Before stimulus offset, the neurons local to one glomerulus are either excited or inhibited.
104
Christiane Linster and Claudine Masson
2. After stimulus offset, due to the membrane time constant, each neuron memorizes its activity level for several milliseconds. 3. If during that period (or at any moment after stabilization of the activity pattern) the lateral inhibition strength is cut off or considerably decreased, the local excitatory elements "take over": in those glomeruli that have been activated by the stimulus the local excitation will enhance this activation, whereas those glomeruli that have been inhibited will stay silent. 4. The neural activity of all neurons local to one glomerulus is sustained after stimulus offset, due to the local excitatory elements which feed back to each other. 5. When the lateral inhibition strength recovers its original value, its local effects become stronger than those of the excitatory elements and the global activity level goes back to its spontaneous level. Two conditions on the choice of parameters are necessary for these phenomena to occur: (1) after stabilization of the activity map in response to an odor input, active excitatory elements should be driven into saturation (as explained above, the total input to these elements has to be higher than their saturation threshold), and (2) the coefficient of the positive feedback connection on excitatory elements has to be higher than their saturation threshold (plus the local inhibition) (see the Appendix, Section 4 for derivation). This means that in the absence of lateral inhibition, the equation governing the evolution of the excitatory interncurons becomes unstable for those excitatory neurons that fire at their maximal spiking frequency and thus keeps them in saturation even in the absence of external input. Excitatory elements that are inhibited d o not send a positive feedback onto themselves; they stay silent and cannot activate themselves. Figure 6 shows the activity of several output neurons due to a stimulus application (experience with stimulus A), and their evolution during sensory storage (experience with stimulus B). Right after stimulus application, all glomeruli are active, competing for a few sampling steps, until some glomeruli "win" and stay active, whereas others "lose." After stimulus offset, in normal conditions, the activity returns to its spontaneous level after 4-8 msec (experience A). In experience B, the value of the lateral inhibition strength is set to zero shortly after stimulus offset ( 2 msec), and begins to increase toward its original value after 24 msec. Due to the local excitatory elements, each glomerulus tends to enhance its activity: active glomeruli stay active, whereas glomeruli that have been inhibited by the glomerular competition stay silent. Figure 7 shows the evolution of all glomerular activities starting 4 msec (two sampling steps) before stimulus offset. In Figure 7A, no modulation of the lateral inhibition occurs, whereas in Figure 78, the lateral inhibition strength is set to zero 2 msec after stimulus offset, and begins to slowly increase toward its original value after 24 msec.
105
Model of Olfactory Sensory Memory in the Honeybee
!mi,
ON2
. .
. . I
.
,
~ i l i. I . I, ,
,
,
. .
.
.ii.
,
.
,
,
.I,
. .
,
. .
L
.
.I,
Membrane potential
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 6: Memorization of the neural activity pattern due to modulation of the lateral inhibition strength. The temporal evolution of the average firing activities (upper trace) and the membrane potentials of ONs (ON1-ON4) associated to four glomeruli are traced. In experience A, no modulation of the lateral inhibition is performed; after stimulus offset, the neural activity goes back to its spontaneous activity level. In experience B (the qualitative evolution oi the lateral inhibition is shown below), the lateral inhibition strength is set to zero 2 msec after stimulus offset, and slowly increases toward its original value after 24 msec. The ON activity pattern is memorized while the lateral inhibition is low, and tends to disappear when the lateral inhibition increases. The activities return to the spontaneous activity level when the lateral inhibition recovers its original value. (Stimulus onset and offset are indicated by arrows.) In the model, the decrease of the lateral inhibition strength can be achieved by decreasing the synaptic efficacy of the localized inhibitory interneurons or by increasing their spiking threshold.
4 Discussion The model that has been described makes a number of predictions concerning odor processing and sensory memory in the bee antenna1 lobe neural network. With respect to odor processing, intracellular recorded responses to odor mixtures are in general very complex and difficult to interpret from the responses to single odor components (Sun et al. 1993; and Fig. 8). A
Christiane Liiister and Claudine Masson
106
Stimulm offset
e
IW m; after stimulus offset
Clomeqdi 1 15
Stimulus offset
Lateral inhibition = 0
~
e
Begin of recovery of lateral inhibition
Figure 7: E\dution of tlie glonierular activity pattern after stimulus offset with and lvithout inodulatioii of the lateral inhibition strength. The evolution of the average firing probabilities (each diagrani represents 2 msec) of tlie ONs associated to each glomerulus are shown starting 4 nisec before stimulus offset. ( A ) The activity pattern sloivly disappears after stiniulus offset. (B) Thc lateral inhibition strength is set to zero 2 msec after stiiniilus offset and starts to increase toivard its original value after 24 iiisec. The neural activity pattern of the glomerular ONs is memorized after stimulus offsct.
tendency to select particular odor-related information is expressed in the category of localized antenna1 lobe neurons, both LlNs and ONs. In contrast, both global LINs and ONs are often niore responsive to mixtures than to single components. This might indicate that the related localized glomeruli represent functional subunits that are particularly involved in the discrimination of some key features (Masson et rrl. 1993). In addition to single cell recordings, the study of the spatial distribution of odor-related activity evidenced by 2DG suggests that odor qualities with different biological meaning might be decoded according to separate spatial maps sharing a number of common processing areas (Masson ct n / . 1993; Nicolas ct a / . 1993). Our model suggests the decoding of the olfactory stimuli in spatial maps o f activity spanning tlie whole glomerular layer; it allows us to understand the spatial activity distribution as a function of single cell responses. Recent data in locust antenna1 lobe (taurent and Davidowitz
Model of Olfactory Sensory Memory in the Honeybee
107
Figure 8: Spike rate histograms before, during, and after stimulation of two antennal lobe output neurons. Antenna1 lobe neurons have been recorded intracellularly during olfactory stimulation of the antenna with three pure odors (HEP, 2-heptanone; GER, geraniol; ISO, isoamyle aceate) and their binary and ternary mixtures. Spike rates before (1 sec), during (1 sec), and after stimulation (1 sec) are expressed as a function of the spontaneous rate recorded before stimulation (which corresponds to 100%). (A) Pluriglomerular ON responding with increased spiking frequency to the ternary mixture of the three odorants but almost not to stimulation with single odorants and their binary mixtures; this neuron keeps the high spiking frequency dur to the stimulation (G + H + I) for 1 sec after stimulus offset. (B) Uniglomerular ON responding with increased spiking frequency to GER, with decreased spiking frequency to HEP and with various degrees of excitation to the binary and ternary mixtures. The response to GER exhibits a long-lasting excitation after stimulus offset, whereas the response to HEP exhibits a long-lasting inhibition. From Sun (1991).
1994) strongly suggest the representation of a n odor by a n assembly of coherently firing antennal lobe neurons. Because antennal lobe neurons are generally activated by several odors (in these experiments, complex food odors were used as olfactory stimuli), the assemblies that encode different odors can overlap. Interestingly, models of the vertebrate olfactory bulb (Li a n d Hopfield 1989; Li 1990; Erdi etal. 1993),which compares
108
Cliristiane Linster and Claudine Masson
to the antennal lobe, while implementing a different neural circuitry, predict the same type of odor processing in the glomerular layer. With respect to sertsory nzernory, in the honeybee, cooling experiments combined with single trial learning have shown that cooling of the antennal lobe later than 2 min, the o-lobes of the mushroom bodies later than 3 niin, and the calyces later than 5 min did not impair memory formation (see Menzel 1983, 1984 for review). This indicates that the early memory traces may be located in the antennal lobe, as proposed b y the model. Furthermore, these results suggest a hierarchical transmission of the olfactory images, where the memory traces are established at different layers of processing at different times. Our model of sensory memory will allow us to explore a number of features concerning the transfer of the olfactory images from the antennal lobe and its sensory store to a more permanent associative memory device, presumably located in the mushroom bodies. The model suggests a uniform modulation of inhibition strength in the antennal lobe as a basis of sensory memory. This implicates the presence of neurornodulator circuits, which would be controlled by higher order brain centers. In our experiments, intracellular recordings have evidenced the existence of long-lasting excitation after stimulus offset in some antennal lobe neurons (Fonta rf r11. 1989 and Fig. 8). The precise conditions of occurrence of these phenomena as well as their dependence on the presence of specific neuromodulators will be undertaken in a new set of experiments. The presence of feedback circuits between the mushroom bodies and the antennal lobe interneurons has been suggested before (Masson 1977; Erber 1981), as well as their iniportance for memory formation. In addition, the localization of several neurotransmitters in the bee brain has been evidenced (for a review see Bicker 1993) by use of neurochemic,il tools; the functional study of these neurotransmitters is being undertaken (unpublished data). The niodeling approach combined with new experiments will help us to elucidate the role of these neurotransmitters, allowing us in a unique way to integrate elements of knowledge coming from converging experimental and theoretical approaches. Appendix: Implementation and Simulation Parameters _
_
_
1. Neurons. The different neuron populations associated with each glomerulus are represented in the simulations by one unit (each unit is governed by one difference equntmn). All connection weights and transmission delays are chosen randomly around a mean value. In discrete time, the fluctuation of the membrane potential around the resting potential, due to irzpiif c , i t ) at its postsynaptic sites, is expressed dS
~
Model of Olfactory Sensory Memory in the Honeybee
109
where ~i is the membrane time constant, At is the sampling interval, and ei(t) is the total input to neuron i at time t. The firing probability P[xi(t) = 11 that the state x,(t) of neuron i at time t is 1 is given by a quasilinear function of the neuron membrane potential vi(t) at time t, where the lower threshold Omin determines the amount of noise, and the upper threshold Om,, determines the value of the membrane potential for which the maximal firing probability is reached:
A @mill
1
h
x v?)
The value of the transmission delay associated with each synapse is chosen randomly around a given average value; it is meant to model all sources of delay, transduction, and deformation of the transmitted signal from the cell body or dendrodendritic terminal of neuron j to the receptor site of neuron i. The mean value of the delay distribution is longer for inhibition than for excitation: we thereby take into account approximately the fact that IPSCs usually have slower decay than EPSCs, and may accumulate to act later than actually applied. 2. Molecule Arrays and Receptor Cells. Odorants are represented in a ten dimensional, discrete odorant space; a stimulation corresponds to a particular point in this space. R receptor cells are differentially sensitive to all M molecules: each receptor cell has a maximal (I) sensitivity to one molecule (center of the gaussian sensitivity curve); its sensitivity to surrounding molecules is given by a gaussian function with width 1. Each receptor cell projects onto a subset of N glomeruli with an afferent weight W R . 3. Intemeurons. Local excitatory elements are local to one glomerulus, they send excifatovy input to all localized inhibitory interneurons and projection neurons associated to that glomerulus, and they receive input from receptor cells and from all inhibitory interneurons sending input (local and lateral) to their glomerulus:
gf,
where e, is the input to the excitatory neuron associated to glomerulus j, R, is the output of the receptor cell projecting to glomerulus j ; I, is the output of the localized inhibitory interneuron whose high branching pattern is in glomerulus j : wi is its connection strength and r1 is its transmission delay;
Christiane Linster and Claudine Masson
110
Ig are the outputs of the neighboring localized inhibitory interneurons associated to glomeruli g: wls is their connection strengths and rIs is their transmission delays; E, is a recurrent input of the excitatory element onto itself 7 u E is its connection strength and rE is its transmission delay. Localized inhibitory iizterizeziroizs are associated with one glomerulus: they send lateml inhibitory input to inhibitory and excitatory interneurons in neighboring glomeruli as well as local iizhibition to the neurons in their principal glomerulus; they receive input from the receptor cells projecting onto their principal glomerulus and from the local excitatory elements in that glomerulus, as well as lateral inhibition input from surrounding glomeruli:
where i, is the input to the localized inhibitory interneuron associated with glomerulus j. Localized ozi tput neurons integrate synaptic activity from all interneurons with principal synapses in their associated glomerulus: O , ( t ) = 7/+1,(t- r
~ WEE,(^ )
-
7’~)
where o,( t ) is the input to the projection neuron associated with glomerulus j. All membrane potentials and outputs are computed according to the equations given above; the values of the time constants and thresholds are given below. 4. Conditions on Parameters. The behavior of the model presented here is largely dominated by the excitatory interneurons and their positive feedback connections. The system is stable (it can be activated u p to saturation only by exterior input from receptor cells and not by intrinsic noise) if, in the absence of external input, the total input to the excitatory elements is smaller than their saturation threshold (thus, their firing probability stays smaller than 1):
In the worst case, Yj?g.I(f) = E(t) then G
wI
+C + xk
U’E
< emax
=
1.0 (maximal spiking frequency),
Model of Olfactory Sensory Memory in the Honeybee
111
The signal-to-noise ratio after stabilization of an activity pattern in response to odor stimulation is large if all activated excitatory interneurons reach their saturation threshold; thus =
e,(t)
zuRR,(f) + wrI,(f - YI)
G
+ CwlgIg(t
-
rig)
+ WEEi(t
-
YE)
> Omax
gf1
€,(f)
=
I
[where E,(t) is the output-firing probability-of neuron I at time t], and for inactive excitatory interneurons after stabilization:
e,(f)
=
WRR,(f) + WI,(f - TI)
G
+ Cw,glg(t
-
+ wEEj(f
-
YE)
< @m,,
gfi
E,(t)
=
0
for neurons that will be inhibited after stabilization. If, after stabilization of the activity pattern, lateral inhibition is set to zero, active excitatory interneurons will stay in saturation even after stimulus offset if
Inactive excitatory interneurons will stay inactive because E,(f - YE)= 0 and the positive feedback loop is interrupted for these neurons. Thus, the removal of the lateral inhibition creates a positive feedback loop driving the system into saturation for those glomeruli that are activated by the exterior input. Recovery of the lateral inhibition reduces the positive feedback loop and drives the system back into its original, stable state. 5. Simulation Parameters. In all simulations described in the text, the following parameters have been used:
R = 5 receptor cells sensitive to M = 10 different types of molecules; Matrix of receptor cell sensitivities: Receptors
1.000 0.018 0.000 0.000 0.018
0.368 0.018 0.001 0.368 1.000 0.368 0.001 0.018 0.368 0.000 0.000 0.001 0.001 0.000 0.000
Odorants 0.000 0.000 0.000 0.018 0.001 0.000 1.000 0.368 0.018 0.018 0.368 1.000 0.000 0.001 0.018
0.001 0.018 0.368 0.000 0.000 0.001 0.001 0.000 0.000 0.368 0.018 0.001 0.368 1.000 0.368
112
Christiane Linster and Claudine Masson
G = 15 glomeruli, thus each receptor cell projects onto N = 3 neighboring glomeruli with afferent connection strength:
7.5 (&lo%); 6.0 (110%) is the connection strength of local excitatory elements, r~ = 2 msec (110%) their transmission delay; zoI = -1.0 (110%)is the local inhibitory connection strength, 1’1 = 2 msec (110%)its transmission delay; uilX= -2.5 (110%) is the maximal value of the lateral inhibition strength (to the two nearest neighboring glomeruli); this value decays linearly toward a minimal value of 7 ~ 1 = , ~ -0.5 as the distance between glomeruli g and j increases; r,$ = 6 msec is the shortest transmission delay (between two nearest neighboring glomeruli); the delay increases toward a maximal value of 15 msec as the distance between glomeruli g and j increases. TOR =
=
For all neurons, the value of the thresholds are Om,, = -0.1 and
em,,= 4.0; for excitatory local interneurons, Om,, = 0.01; the membrane time constant T = 5 msec for inhibitory LINs and ONs and J = 8 msec for excitatory LINs. Updating of all neurons is synchronous, with At = 2 msec.
Acknowledgments The authors are thankful to David Marsan for his inspiring ideas, which helped to start this project. They thank Jean-Pierre Nadal, Stefan Knerr, and Brigitte Quenet for constructive criticisms on the manuscript and G. Arnold, G. Dreyfus, J. Gascuel, and M. Kerszberg for valuable discussions.
References Abott, L. F. 1990. Modulation of function and gated learning in a network memory. Proc. Nntl. Acnd. Sci. U.S.A. 87, 9241-9245. Akers, R. I?, and Getz, W. M. 1993. Response of olfactory receptor neurons in honeybees to odorants and their binary mixtures. J. Cornp. Physiol. A 173, 169-185. Arnold, G., Masson, C., and Budharugsa, S. 1985. Comparative study of the antennal afferent pathway of the workerbee and the drone (Apis rnellifera L.). Cell Tisszie Res. 242, 593-605. Bicker, G. 1993. Chemical architecture of antennal pathways mediating proboscis extension learning in the honeybee. Apidologie 24, 235-248. Erber, J. 1981. Neural correlates of learning in the honeybee. TlNS 4, 270-273.
Model of Olfactory Sensory Memory in the Honeybee
113
Erber, J., Masuhr, Th., and Menzel, R. 1980. Localization of short-term memory in the brain of the bee, Apis mellifera. Physiol. Entomol. 5, 343-358. Erdi, P., Grobler, T., Barna, G., and Kaski, K. 1993. Dynamics of the olfactory bulb: Bifurcations, learning and memory. Biol. Cybern. 69, 57-66. Esslen, J., and Kaissling, K.E. 1976. Zahl und Verteilung antennaler Sensillen bei der Honigbiene. Zoomorphologie 83, 227-251. Fonta, C., Sun, X. J., and Masson, C. 1991. Cellular analysis of odour integration in the honeybee antennal lobe. In Behavior and Physiology of Bees, L. J. Goodman and R. C. Fisher, eds., pp. 227-241. Cab International, Wallingford, UK. Fonta, C., Sun, X., and Masson, C. 1993. Morphology and spatial distribution of bee antennal lobe interneurons responsive to odours. Chem. Senses 18(2), 101-1 19. Gascuel, J., and Masson, C. 1991. Quantitative electron microscopic study of the antennal lobe in the honeybee. Tissue Cell 23,341-355. Hasselmo, M. E. 1993. Acetycholine and learning in a cortical associative memory. Neural Comp. 5, 3244. Hasselmo, M. E., Anderson, B. P., and Bower, J. M. 1992. Cholinergic modulation of cortical associative memory function. I. Neurophysiol. 647(5), 12301246. Kerszberg, M., and Masson, C. 1995. Signal induced selection among spontaneous activity patterns of bee’s olfactory glomeruli. Biol. Cybern. 72,487495. Laurent, G., and Davidowitz, H. 1994. Encoding of olfactory information with oscillation assemblies. Science 265, 1872-1875. Li, Z. 1990. A model of olfactory adaptation and sensitivity in the olfactory bulb. Bid. Cybern. 62, 349-361. Li, Z., and Hopfield, J. J. 1989. Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybern. 61, 379-392. Linster, C., and Masson, C. 1994. Odor processing in the honeybee’s antennal lobe glomeruli: Modelling sensory memory, accepted for publication. In Computational Neural Systems, F. H. Eeckman, ed., Kluwer Academic Publishers, Boston. Linster, C., Marsan, D., Masson, C., and Kerszberg, M. 1994. Odor processing in the bee: A preliminary study of the role of central input to the antennal lobe. In Advances in Neural lnformatian Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 527-534. Morgan Kaufmann, San Mateo, CA. Malun, D. 1991a. Inventory and distribution of synapses of identified uniglomerular projection neurons in the antennal lobe of Periplaneta americana. J. Comp. Neurol. 305, 348-360. Malun, D. 1991b. Synaptic relationships between GABA-immunoreactive neurons and an identified uniglomerular projection neuron in the antennal lobe of Periplaneta americana: A double labeling electron microscopic study. Histochemistry 96, 197-207. Masson, C. 1977. Central olfactory pathways and plasticity of responses to odor stimuli in insects. In Olfaction and Taste Vl, J. Le Magnen and P. MacLeod, eds., pp. 305-314. IRL, London.
Christiane Linster and Claudine Masson
114
Masson, C., and Linster, C. 1995. Towards a cognitive understanding of odor discrimination. Be1rclikt.d Prcicrsses, Vol. 35 (in press). Masson, C., and Linster, C. 1994. Towards a cognitive understanding of odor discrimination. B ~ h z l Proci~sses. . Specid issue: Cogriitioii nrzd Ezdzltiorz (in press). Masson, C., Pham-Delegue, M. H., Fonta, C., Gascuel, J., Arnold, G., Nicolas, G., and Kerszberg, M. 1993. Recent advances in the concept of adaptation to natural odour signals in the honeybee Apis irieflifern L. Apidologic 24, 169-194. Menzel, R. 1983. Neurobiology of learning and memory: The honeybee as a model system. Nntf!ric,issc.iiscli~fti~zf 70, 504-511 . S i d ~ s t r d c self Menzel, R. 1984. Short-term memory in bees. In Priimry N w I . ~ Lmrriiiig m i i f BdioiGoral C l i n q y , Alkon, D. L. and Farley, J., eds., pp. 259-274. Cambridge University Press. Menzel, R., Michelsen, B., Riiffer, P., and Sugawa, M. 1988. Neuropharmacology of learning and memory in honey bees. In M o d i i / i i f i ~ iofS!/rzq~fic/i Trnrisniissiou am$ Plnsticity iri Newiiirj Systtws, G. Herting and H. C. Spatz, eds., pp. 333350. Nato AS1 series H19. Mercer, A,, and Menzel, R. 1982. The effects of biogenic amines on conditioned and unconditioned responses to olfactory stimuli in the honeybee Apis r i d /ifc’J’O. 1. COr7i/J.P//!/SiOl. 145, 363-368. Michelsen, B. D. 1988. Catecholaniines affect storage and retrieval of conditioned odour stimuli i n honeybees. Corrrp. Bicic~feuz.P / l p k J / . 91C, 479382. Mobbs, P.G. 1984. Neural networks in the mushroom bodies of the honeybee. 1. Iizwct. Plysiol. 30(1), 43-58. Nicolas, G., Arnold, G., Patte, F., and Masson, C. 1993. Regional distribution of -3Hj2-deoxyglucose uptake in the wrorker honeybee antenna1 lobe. C . X . Acnd. Sci. 316, 1245-1249. Paris. Sun, X. J. 1991. Caracterisation Electrophysiologique et Morphologique des Newones Olfactifs du Lobe Antennaire de 1‘Abeille, Apis lfJi’//iff’f’i7. ThPse de Doctorat de I’Universitt! de Paris-Sud, Centre d’Orsay, France. Sun, X., Fonta, C., and Masson, C. 1993. Odour quality processing by bee antenna1 lobe neurons. C/zcrir.stmi? 18(4),355-377. Vareschi, E. 1971. Duftunterscheidung bei der Honigbiene. Eiiizelzel-Ableitungeii und Verhaltuiigsreaktion. Z . V q l . Phy~ioI.75, 143-1 73. Zipser, D. 1991. Recurrent network model oi the neural nicchanism of shortterm memory activity. Ni’llf’i7/Cotrip. 3, 179-1 93. .
~
__
~.
-
Received June 22, 1991, accepted March 15, 1995
This article has been cited by: 2. Dominique Martinez. 2005. Detailed and abstract phase-locked attractor network models of early olfactory systems. Biological Cybernetics 93:5, 355-365. [CrossRef] 3. Thomas A. Cleland, Christiane Linster. 2002. How synchronization properties among second-order sensory neurons can mediate stimulus salience. Behavioral Neuroscience 116:2, 212-221. [CrossRef] 4. Jean-Marc Fellous, Christiane Linster. 1998. Computational Models of NeuromodulationComputational Models of Neuromodulation. Neural Computation 10:4, 771-805. [Abstract] [PDF] [PDF Plus]
Communicated by Richard Lippmann
A Spherical Basis Function Neural Network for Modeling Auditory Space Rick L. Jenison Kate Fissell Department of Psyclzology, University of Wisconsin, Madison, W153706 U S A This paper describes a neural network for approximation problems on the sphere. The von Mises basis function is introduced, whose activation depends on polar rather than Cartesian input coordinates. The architecture of the von Mises Basis Function (VMBF) neural network is presented along with the corresponding gradient-descent learning rules. The VMBF neural network is used to solve a particular spherical problem of approximating acoustic parameters used to model perceptual auditory space. This model ultimately serves as a signal processing engine to synthesize a virtual auditory environment under headphone listening conditions. Advantages of the VMBF over standard planar Radial Basis Functions (RBFs) are discussed. 1 Introduction
Artificial neural networks and approximation techniques typically have been applied to problems conforming to an orthogonal Cartesian input space. In this paper we present a neural network operating on a problem in acoustics whose input space is best represented in spherical (or polar), rather than Cartesian coordinates. The neural network employs a novel basis function, the von Mises function, which is well adapted to spherical input, within the standard Radial Basis Function (RBF) architecture. The primary advantage of a basis on a sphere, rather than on a plane, is the natural constraint of periodicity and singularity at the poles. The RBF network architecture using a single layer of locally tuned units (basis functions) covering a multidimensional space is now well known (e.g., Broomhead and Lowe 1988; Moody and Darken 1988,1989; Poggio and Girosi 1990). Gradient-descent learning rules akin to backpropagation that move and shape the basis functions to minimize the output error have also been proposed (Poggio and Girosi 1990; Hartman and Keeler 1991). Our network was constructed to synthesize a continuous map of acoustic parameters used to simulate the virtual experience of free-field spatial hearing under headphones. The actual signal processing details Neural Computation 8,115-128 (1996) @ 1995 Massachusetts Institute of Technology
116
Rick L. Jenison and Kate Fisseil
used to synthesize auditory space will for now be deferred, with the main focus being on the description of the spherical neural network and corresponding learning rules. It is anticipated that this neural network will have general application to functional approximation problems on the sphere, such as inverse kinematics of spherical mechanisms (Chiang 1988)and global weather prediction (Wahba and Wendelberger 1980; Ghil et al. 1981). 2 von Mises Basis Function (VMBF) Network
The network basis function is based on a spherical probability density function (p.d.f)' that has been used to model line directions distributed unimodally with rotational symmetry. The function is well-known in the spherical inferential statistics literature and commonly referred to as either the von Mises-Arnold-Fisher or Fisher distribution [see Mardia (1972) or Fisher et al. (1987)l. The kernel form of the p.d.f. was first introduced by Arnold (1941) in his unpublished doctoral dissertation. Historically, the function has served as an assumed parametric distribution from which spherical data are sampled; but we are not aware of its use as a basis function in the approximation theory literature. The expression for the von Mises basis function, dropping the constant of proportionality and elevational weighting factor from the p.d.f., is VM(B.p. a. ,I. ,.) = p ~ [ ~ i n Q s i n ~ c o s ( H - n ) + c o s ~ c o s 8 1 (2.1) where the input parameters correspond to a sample location in azimuth and elevation (B. qb), a centroid in azimuth and elevation ((Y, /I), and the concentration parameter K . Application of the von Mises function requires an azimuthal range in radians from 0 to 27r and elevational range from 0 to 7 r . Any sample (0.4) on the sphere will induce an output from each VMBF proportional to the solid angle between the sample and the centroid of the VMBF (0,p). The azimuthal periodicity of the basis function is driven by the cos(8 - a ) term, which will be maximal when B = 0 . The (sin 4 sin /3) term modulates the azimuthal term in the elevational plane, hence the requirement that 4 range from 0 to 7 r . As the sample elevation or the centroid elevation approaches either pole (0 or 7 r ) , the multiplicative effect of (sin sin 13) progressively eliminates the contribuS term dominates. h is tion of azimuthal variation and the (cos ~ C O 6) a shape parameter called the concentration parameter, where the larger the value the narrower the function width after transformation by the expansive function e. While other spherical functions have been proposed for approximation on the sphere [e.g., thin-plate pseudosplines (Wahba 1981)], the VMBF serves as a convenient spherical analogue of the wellknown multidimensional gaussian on a plane (see Fig. 1). It behaves in a similar fashion to the planar gaussian with the centroid corresponding
Spherical Basis Function Neural Network
117
Figure 1: Three-dimensional rendering of two von Mises basis functions positioned on the unit sphere. The basis functions are free to move about the sphere, changing adaptively in position, width, and magnitude.
to the mean and l/r; corresponding to the standard deviation. It differs from the thin-plate spline in that it has a parameter for controlling the width or concentration of the basis function, which allows the VMBF to focus resolution optimally. The von Mises basis function serves as the activation function of the hidden layer units conforming to the RBF architecture as shown in Figure 2. The output of the ith output node, f , ( H , 4), when spherical coordinates are presented as input, is given by
(2.2)
where VM(B.4.a,, PI. q) is the output of the jth von Mises basis function and wl1is the weight connecting the jth basis function with the ith output node.
Rick L. Jenison and Kate Fissell
118
Basis Functions
Bias Unit
Figure 2: Architecture of the von Mises Basis Function (VMBF) neural network. 3 Parameter Learning
To optimize the approximation function given a fixed number of basis functions we apply a gradient-descent method on an error function to update the parameters of the network. In our case we require the sum-of-squared-error to be minimized. This technique has been applied successfully to gaussian RBF neural networks (Moody and Darken 1989; Poggio and Girosi 1990; Hartman and Keeler 1991) and we derive the analogous equations for the von Mises basis here. The error function for the pth M-dimensional training pattern is
where t,, is the ith element of the pth training pattern. The notation of the network output f,,(H? &S2,) includes [ I , ,which is a vector of changing network parameters
Parameter values are learned through successive presentation of inputoutput pairs using the well-known update rule
,:I
= sy’d
+ ,I
ASI,
(3.3)
where rl corresponds to the learning rate, which can also be updated as learning proceeds. Obtaining AR, involves computing the error gradient by differentiating the error function with respect to each free parameter in the network. These derivations are relatively straightforward using
Spherical Basis Function Neural Network
119
the chain rule and algebra, however are rather involved with respect to the bookkeeping of indices. For convenience R,is omitted from the following derivations. The update expression for wij is just the WidrowHoff learning rule
The update expression for the azimuthal movement of the basis center can be derived in a similar fashion
M
x
-
fi(e.4)lwl,)VM(d. 4. "1.0,K.J )
(3.5)
I
as can the update expression for elevational movement &3,
=
dE
[sin 9 cos /j,cos(t) - a,) - cos Q sin p,]
--
a,=
M
x
-
h(@,4)lw,} VWQ. 9. 3,.q ) "13
(3.61
I
and finally the concentration parameter
(3.7)
The degree of improvement realized by the gradient-descent optimization of the parameter vector Q, over a direct matrix pseudoinverse solution of wTi alone depends on the particular problem as well as the number of basis functions available. We have typically observed about a 2-fold improvement in the final root-mean-square (RMS) error for networks with a small number of basis functions, which will be discussed for our specific application in the following section. 4 Approximating Auditory Space
The VMBF neural network is ideally suited to the problem of learning a continuous mapping from spherical coordinates to acoustic parameters that specify sound source direction. The spatial position of a sound source in a listener's environment is specified by a number of factors related to the interaction of sound pressure waves with the pinna (external ear), head, and upper body. These interactions can be described mathematically in the form of a linear transformation, commonly referred to
120
Rick L. Jenison and Kate Fissell
as a "head-related transfer function," or HRTF. HRTFs are measured by sampling the air pressure variation in time near the eardrum as a function of known sound source location. Generally, HRTFs are visualized in the frequency (spectral) domain rather than in the Fourier equivalent time domain, because it is a more meaningful way of characterizing acoustic processing by the auditory system.2 Henceforth, an n-dimensional vector will represent the HRTF, where each element of the vector corresponds to a discrete sample in frequency. Figure 3 illustrates a typical spherical grid of speaker placements used to position sound sources, preferably within an anechoic environment. Studies of human sound localization most often use a coordinate system in degrees with the origin (O", 0")located at the intersection of the horizon and the medial saggital plane directly in front of the listener. We adhere to this convention for the remaining discussion, aware that coordinates must be converted from degrees to radians under the range constraints imposed by the VMBF (equation 2.1). Enforcement of these constraints is accomplished by mapping -180" 5 H < +180" to 0 5 H < 27r and -90" 5 4 5 +90° to 0 5 d 5 W. HRTFs change in complicated, but systematic ways as a function of sound direction relative to each ear. The HRTFs completely specify the sound source location because the measurements are made near the point where sound is transduced by the auditory system. The measurement and analysis techniques are well developed and have been detailed in the psychoacoustic literature (Wightman and Kistler 1989; Middlebrooks rt al. 1989; Pralong and Carlile 1994). One practical use of these measurements is to create the sense of a sound coming from any "virtual" location under a headphone listening condition. This is accomplished by convolving an HRTF with any digitally recorded sound, effectively simulating the actual spectral filtering of the individual's external ear, and inducing an apparent location of the sound. Using the VMBF neural network to create the parameters for a virtual environment affords the ability to synthesize a set of HRTFs (for each ear) for any location in space, not just the spherical locations where measurements were obtained. Furthermore, the continuous modeled environment affords smooth auditory motion (sound trajectory). Others have recently applied standard regularization techniques to the problem of HRTF approximation (Chen ef al. 1993), using a basis of planar thin-plate splines rather than spherical basis functions, and without benefit of gradient-descent learning. Their approximation performance is consistent with that of the standard class of planar RBF networks, hence subject to the distortions addressed by this paper. From both a practical as well as a theoretical standpoint, the measured HRTFs are of much higher dimensionality than necessary for modeling purposes. The redundancy inherent in the measurement can be statis?Vision, on the other hand, is more naturally considered in spatial terms due to the retinotopic projection.
Spherical Basis Function Neural Network
121
90'
Figure 3: Coordinate system for sampling head-related transfer functions (HRTFs). Measurements are recorded near the eardrum of a subject placed in the center of a spherical speaker array. Azimuth is denoted by Q (ranging periodically from -180" to f180" or 0 to 27r) and elevation by 4 (ranging from -90" to +90" or 0 to K). The origin 0') of the coordinate system is directly in front of the listener. (OO,
tically removed via the technique of principal component analysis (see Kistler and Wightman 1992). Each HRTF magnitude spectrum consists of 256 real-valued frequency components for each ear and each sound source location in space, hence, a 256-dimensional vector. From the set of all measurements of sound source locations, we are able to reduce the dimensionality of the frequency components to six principal components (for each ear and location) that account for 98% of the spectral variance. This linear operation allows the use of six output units in the network rather than 256 frequency components while losing only a negligible
122
Rick L. Jenison and Kate Fissell
amount of information. Reconstruction to the original dimensionality of the HRTFs is a straightforward linear inverse transformation. 5 Learning Auditory Space
150 HRTF measurements front a single ltuntan ear served as the database for training a VMBF neural network.3 These measurements, corresponding to discrete locations on the sphere, ranged from -170' to 180" in azimuth and -50' to 90'in elevation in 10' steps. At higher elevations the solid angle spanned is less, thus requiring fewer samples. The six principal components described above are the M elements of the training pattern t,,,. The total number of training patterns was 400, with 50 random pitterns reserved for test data Le., data fed forward through the network without parameter updating). The network parameters were initialized by positioning the basis functions uniformly on the input space with a small degree of relative overlap and solving the output weights w,, with '3 single pseudoinverse. Iterative training then proceeded with the successive presentation of training patterns, evaluation of gradient-descent equations, and update of free parameters. Figure 4 shows the training and intermediate testing history for 2500 epochs for a VMBF network with 9 basis functions. The initial point of the learning curves corresponds ot the RMS error immediately following the initialization of the VMBF network parameters. Improvement due to learned basis function positions and shapes is clearly evident in Figure 1. The RMS testing error is only slightly worse than that of the training data, which demonstrate reasonable generalization to novel data. Good generalization is desirable since the training data themselves are somewhat noisy. Figure 5 illustrates two views of the final positions and widths of the nine von Mises basis functions following gradient-descent learning. This particular network was trained on measurements taken from the right ear. Due to the acoustical interaction of the head with the radiating sound pressure wave, most of the variability, hence available information, in the HRTFs occurs when the sound source is on the same side of the head. This asymmetry is reflected in the final positioning of the basis function as shown in Figure 5, and demonstrates the directness with which we can interpret the learned hidden unit weights (positions and widths). Planar gaussian RBF networks applied to this spherical problem generally perform as well as the VMBF network in regions near the center of the planar projected input space, but perform less well near the edges. As learning proceeds, the centers of the gaussian basis functions move inward due to the artificially absent data beyond the edges. The periodic VMBF network is immune to this bias due to its intrinsic lack of edges. jHRTI7s \ $ w e obtained from Drs. Wightman and Kistler, Hearing Development Research Laboratory, University of Wiscotisiti-Madison.
Spherical Basis Function Neural Network
0.111
I
1
123
I
I
I
I
I
0.105 0.1
0.075
I
\\
h
U
0.07
Figure 4: Root-mean-squared error for training and testing data as a function of training epochs.
To demonstrate this immunity, 40 input-output pairs were systematically split off from the original database of 450 in the region near an arbitrarily defined edge (*lSOO azimuth) for use as novel probe data (in contrast with a distribution of randomly selected test data.) The average magnitudes of the probe data set and the training data set were equalized. The top panel of Figure 6 shows the progressive decline in RMS error as training progresses for a gaussian RBF network and a VMBF network with the remaining 410 input-output pairs. Both networks have nine basis functions and the appropriate gradient-descent update rules. The bottom panel shows the initial decline in RMS error of the testing data. As the gaussian RBF network learns the training set, the RMS error of the probe data rises dramatically . In contrast, the VMBF network training generalizes well to the probe data due to its intrinsic periodicity. For this particular set of novel probe data, the gaussian RBF network must extrapolate beyond the artificial edge of the training data, while the VMBF network performs a spherically constrained interpolation. Figure 7 illustrates the principal component surfaces approximated by
124
Rick L. Jenison and Kate Fissell
a
Figure 5: Positions and relative widths of the von Mises basis functions from a 9-hasis function VMBF network shown from two viewpoints: (a) 45" to the right of the median line and (b) directlv overhead, The displayed width of each basis function is determined by a cross-sectional cut 25% down from the peak o f each basis function, which by definition is located at the centroid.
Spherical Basis Function Neural Network
125
Training Data 0.1
0.1
0.
z
2 0.0
3
0.0
0.0
A
0.0
0.06' 0
500
500
1000
z000
1500
1
1
1000
1500
1
z000
1
2500
Epochs
Figure 6: Comparison of gaussian RBF network and VMBF network generalization to novel probe data selected in the region near an arbitrary edge (k180" azimuth). Gaussian RBF error is denoted by the fine line and VMBF error is denoted by the bold line.
126
Rick L. Jenison and Kate Fissell
Figure 7: Final approximation surfaces for a 25-basis function VMBF. (A, B) The database of HRTF principal coiiipoiients I (A) and 11 ( 8 )as a function of direction (azimuth and elevation) in steps of lo-. The actual network contained 6 output units ( i e , four more in addition to the 2 shown) corresponding to the 6 principal components derixred from the total 450 256-dimensional HRTFs. (C, D) The results of VMBF network training. The predicted principal components I (C) mil I1 (D) are shorvn in increments ot 5 . a 25-basis function VMBF network. Figures 7A and 7B show the known first and second principal components (I and II), respectively, as a function of azimuth and elevation location, which together account for 93% of the total variance of the measured 450 HRTFs. Note that measurements were not taken below -50' elevation d u e to technical constraints in obtaining those samples. Figures 7C and 7D show the principal components predicted by the VMBF network as a function of spherical input for the entire range of positions. The data are plotted on a two-dimensional Cartesian grid with edges, rather than on the more difficult to visualize spherical grid. Therefore, the graphic representation of samples near the poles will naturally distort in the fashion well-known to cartographers. For example, all of the samples at -90' (south pole) are the same mcasuremrnt regardless of azimuth, hence an isomorphic line of output from
Spherical Basis Function Neural Network
127
the network should, and does, occur at this elevation. This singularity can be observed in Figures 7C and 7D. Due to the spherical topology of the basis function, this constraint is fundamental to the VMBF network; in contrast, the gaussian RBF network operating in Cartesian space would not enforce this constraint. Smoothing of the measured data as a consequence of the well-trained approximation function is also evident in Figures 7C and 7D. 6 Conclusions
The model of auditory space represents a good example of tailoring a particular neural network architecture (or basis function) to the appropriate input representation, in this case a spherical representation. Because of the periodic nature of the spherical basis, there are no edge effects that arise when using the multidimensional gaussian for approximation in Cartesian space. Well-behaved networks are obtained as a result of this spherical constraint. Work is ongoing to better characterize the hidden-layer subspace and learning dynamics. These algorithms currently serve as the foundation for ongoing research into implementing auditory virtual environments. We are particularly interested in how a human listener could be integrated into the network learning loop for individual tuning of an auditory space model trained on another person’s set of HRTFs, thereby eliminating the need for technically demanding acoustic measurements.
Acknowledgments This work was supported in part by the Wisconsin Alumni Research Foundation and the Office of Naval Research. We greatly appreciate the database of HRTFs provided by Fred Wightman and Doris Kistler.
References Arnold, K. J. 1941. On spherical probability distributions. Unpublished Ph.D. Thesis, Massachusetts Institute of Technology. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chen, J., Van Veen, B. D., and Hecox, K. E. 1993. Synthesis of 3D virtual auditory space via a spatial feature extraction and regularization model. In l E E E Virtual Reality Aniiu. Int. Symp. 188-193. Chiang, C. H. 1988. Kinematics of Sphevical Mechanisms. Cambridge University Press, Cambridge. Fisher, N. I., Lewis, T., and Embleton, B. J. J. 1987. Statistical A~zalyslsisof Spherical Data. Cambridge University Press, Cambridge.
128
Rick L. Jenison and Kate Fissell
Ghil, M., Cohn, S., Tavantzis, J., Bube, K., and Isaacson, E. 1981. Applications of estimation theory to numerical weather prediction. In Dynamic Meteorology: Data Assimilation Methods, L. Bengtsson, M. Ghil, and E. Kallen, eds., pp. 139284. Springer-Verlag, New York. Hartman, E. J., and Keeler, J. D. 1991. Predicting the future: Advantages of semilocal units. Neural Comp. 3, 566-578. Kistler, D. J., and Wightman, F. L. 1992. A model of head-related transfer functions based on principal component analysis and minimum phase reconstruction. /. Acoiist. Soc. Am. 91, 1637-1647. Mardia, K. V. 1972. Statistics of Directional Data. Academic Press, London. Middlebrooks, J. C., Makous, J. C., and Green, D. M. 1989. Directional sensitivity of sound-pressure levels in the human ear canal. 1.Acoust. SOC.A m . 86, 89108. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Coiiizectionist Models Siimiiier School 1-11, Moody, J., and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Cotizp. 1, 281-294. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. I E E E 78, 1481-1496. Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithiizs for Approximation, J. C . Mason and M. G. Cox, eds., pp. 143-166. Clarendon Press, Oxford. Pralong, D., and Carlile, S. 1994. Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature "in-ear" recording system. 1.Acoust. Soc. Aim 95, 3435-3444. Wahba, G., and Wendelberger, J. 1980. Some new mathematical methods for variational objective analysis using splines and cross-validation. Monthly Weatker Rev. 108, 1122-1145. Wahba, G. 1981. Spline interpolation and smoothing on the sphere. S l A M / . Sci. Stat. CoiiIplit. 2, 5-16. Wightman, F. L., and Kistler, D. J. 1989. Headphone simulation of free-field listening I: Stimulus synthesis. /. Acoust. Soc. Am.85, 858-867.
Received July 29, 1994; accepted March 20, 1995.
This article has been cited by: 2. Tze-Yiu Ho, Chi Sing Leung, Ping-Man Lam, Tien-Tsin Wong. 2009. Efficient Relighting of RBF-Based Illumination Adjustable Images. IEEE Transactions on Neural Networks 20:12, 1987-1993. [CrossRef]
Communicated by Steve Nowlan
On Convergence Properties of the EM Algorithm for Gaussian Mixtures Lei Xu Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, M A 02139 USA and Department of Computer Science, The Chinese University of Hong Kong, Hong Kong Michael I. Jordan Department of Brain and Cognitive Sciences, Massachusetts lnstitute of Technology, Cambridge, M A 02139 USA We build up the mathematical connection between the ”ExpectationMaximization” (EM) algorithm and gradient-based approaches for maximum likelihood learning of finite gaussian mixtures. We show that the EM step in parameter space is obtained from the gradient via a projection matrix P, and we provide an explicit expression for the matrix. We then analyze the convergence of EM in terms of special properties of P and provide new results analyzing the effect that P has on the likelihood surface. Based on these mathematical resuIts, we present a comparative discussion of the advantages and disadvantages of EM and other algorithms for the learning of gaussian mixture models. 1 Introduction
The “Expectation-Maximization” (EM) algorithm is a general technique for maximum likelihood (ML) or maximum a posteriori (MAP) estimation. The recent emphasis in the neural network literature on probabilistic models has led to increased interest in EM as a possible alternative to gradient-based methods for optimization. EM has been used for variations on the traditional theme of gaussian mixture modeling (Ghahramani and Jordan 1994; Nowlan 1991; Xu and Jordan 1993a,b; Tresp et al. 1994; Xu et al. 1994) and has also been used for novel chain-structured and tree-structured architectures (Bengio and Frasconi 1995; Jordan and Jacobs 1994). The empirical results reported in these papers suggest that EM has considerable promise as an optimization method for such architectures. Moreover, new theoretical results have been obtained that link EM to other topics in learning theory (Amari 1994; Jordan and Xu 1995; Neal and Hinton 1993; Xu and Jordan 1993c; Yuille et al. 1994). Despite these developments, there are grounds for caution about the promise of the EM algorithm. One reason for caution comes from conNeural Cornputdon 8, 129-151 (1996)
@ 1995 Massachusetts Institute of Technology
Lei Xu and Michael Jordan
130
sideration of theoretical convergence rates, which show that EM is a first-order algorithm.' More precisely, there are two key results available in the statistical literature on the convergence of EM. First, it has been established that under mild conditions EM is guaranteed to converge toward a local maximum of the log likelihood / (Boyles 1983; Dempster et al. 1977; Redner and Walker 1984; Wu 1983). (Indeed the convergence is monotonic: 1 ( 8 ( ' + ' ) ) 2 l((-)(kl), where 0(')is the value of the parameter vector 0 at iteration k . ) Second, considering EM as a mapping EYk+ll = M(8tk") with fixed point 0' = M ( 0 * ) ,we have (_)('+') - (3*z [ O M ( 0 * ) / d O * ] ( O ck 1@*) when @ ( k + l l is near O', and thus
with
almost surely. That is, EM is a first-order algorithm. The first-order convergence of EM has been cited in the statistical literature as a major drawback. Redner and Walker (1984),in a widely cited article, argued that superlinear (quasi-Newton, method of scoring) and second-order (Newton) methods should generally be preferred to EM. They reported empirical results demonstrating the slow convergence of EM on a gaussian mixture model problem for which the mixture components were not well separated. These results did not include tests of competing algorithms, however. Moreover, even though the convergence toward the "optimal" parameter values was slow in these experiments, the convergence in likelihood was rapid. Indeed, Redner and Walker acknowledge that their results show that . . . "even when the component populations in a mixture are poorly separated, the EM algorithm can be expected to produce in a very small number of iterations parameter values such that the mixture density determined by them reflects the sample data very well." In the context of the current literature on learning, in which the predictive aspect of data modeling is emphasized at the expense of the traditional Fisherian statistician's concern over the "true" values of parameters, such rapid convergence in likelihood is a major desideratum of a learning algorithm and undercuts the critique of EM as a "slow" algorithm. 'For an iterative algorithm that converges to a solution O*, if there is a real number ko, such that for all k > ku, we have
?,, and a constant integer
1 @"f"
-@*I/ 5 qll@w- q p
with q being a positive constant independent of k , then we say that the algorithm has a convergence rate of order y,,.Particularly, an algorithm has first-order or linear convergence if yo = 1, superlinear convergence if 1 < y,, < 2, and second-order or quadratic convergence if yo = 2.
EM Algorithm for Gaussian Mixtures
131
In the current paper, we provide a comparative analysis of EM and other optimization methods. We emphasize the comparison between EM and other first-order methods (gradient ascent, conjugate gradient methods), because these have tended to be the methods of choice in the neural network literature. However, we also compare EM to superlinear and second-order methods. We argue that EM has a number of advantages, including its naturalness at handling the probabilistic constraints of mixture problems and its guarantees of convergence. We also provide new results suggesting that under appropriate conditions EM may in fact approximate a superlinear method; this would explain some of the promising empirical results that have been obtained (Jordan and Jacobs 19941, and would further temper the critique of EM offered by Redner and Walker. The analysis in the current paper focuses on unsupervised learning; for related results in the supervised learning domain see Jordan and Xu (1995). The remainder of the paper is organized as follows. We first briefly review the EM algorithm for gaussian mixtures. The second section establishes a connection between EM and the gradient of the log likelihood. We then present a comparative discussion of the advantages and disadvantages of various optimization algorithms in the gaussian mixture setting. We then present empirical results suggesting that EM regularizes the condition number of the effective Hessian. The fourth section presents a theoretical analysis of this empirical finding. The final section presents our conclusions. 2 The EM Algorithm for Gaussian Mixtures
We study the following probabilistic model: K
P ( x 1 0)=
C a)P(x1 m,.c,)
(2.1)
,=1
and P ( x I m,.C,)
=
1
- I /2 (I-It!, ) I
c; 1 (F,!1!!
p/*
(27r)”/2)C,
where cv/ 2 0 and C,”=,a, = I, d is the dimension of x. The parameter vector 0 consists of the mixing proportions q,the mean vectors rn,, and the covariance matrices C,. Given K and given N independent, identically distributed samples ,?}I#{ we obtain the following log likelihood:2 N
N
*Although we focus on maximum likelihood (ML) estimation in this paper, it is straightforward to apply our results to maximum a posteriori (MAP) estimation by multiplying the likelihood by a prior.
132
Lei Xu and Michael Jordan
which can be optimized via the following iterative algorithm (see, e.g, Dempster et al. 1977):
where the posterior probabilities kjk)are defined as follows:
3 Connection between EM and Gradient Ascent
In the following theorem we establish a relationship between the gradient of the log likelihood and the step in parameter space taken by the EM algorithm. In particular we show that the EM step can be obtained by premultiplying the gradient by a positive definite matrix. We provide an explicit expression for the matrix. Theorem 1. At each iteration of the EM algorithm equation 2.3, we linzie
(3.3)
where (3.4)
(3.5) (3.6)
EM Algorithm for Gaussian Mixtures
133
where A denotes the vector of mixing proportions [NI.. . . C Y K ] ~j , indexes the mixture cornponeifts Cj = 1. . . . ,K), k denotes the iteration number, "vec[B]" denotes the vector obtained by stacking the column vectors of the matrix B, and "8" denotes theKroneckerproduct. Moreover,given theconstraints a:) = 1 and irjk) 2 0, Pa' is a positive definite matrix and the matrices Pi:: aiid Pg: are positive definite with probability one for N sufficiently large. %
cF=,
The proof of this theorem can be found in the Appendix. Using the notation 0 = [m;?.. . , m ~ , v e ~ [ C . . .~.vec[CKIT. ]~, ATIT,and P ( 0 ) = diag[P,,,,. . . . P,,, , Pc,, . . . Pc,. P A ] , we can combine the three updates in Theorem 1 into a single equation: ~
(3.7) Under the conditions of Theorem 1, P ( @ ) ) is a positive definite matrix with probability one. Recalling that for a positive definite matrix B, we have (al/i30)TB(al/i3C3)> 0, we have the following corollary:
Corollary 1. For each iteration of the EM algorithm given by equation 2.3, the search direction E3(kf1j - O(k)has a positive projection on the gradient of 1. That is, the EM algorithm can be viewed as a variable metric gradient changes at each ascent algorithm for which the projection matrix P(ock)) iteration as a function of the current parameter value 0@). Our results extend earlier results due to Baum and Sell (1968), who studied recursive equations of the following form: X(k+l)
= qx(k)).
T(X(k)) = [T(X(k)),. . . . T(X'k')K] ~
c:,
~
where xjk) 2 0, xjk) = 1, where J is a polynomial in xjk) having positive coefficients. They showed that the search direction of this recursive formula, i.e., T ( x @ )) dk),has a positive projection on the gradient of J with respect to the x@)(see also Levinson et al. 1983). It can be shown that Baum and Sell's recursive formula implies the EM update formula for A in a gaussian mixture. Thus, the first statement in Theorem 1 is a special case of Baum and Sell's earlier work. However, Baum and Sell's theorem is an existence theorem and does not provide an explicit expression for the matrix PA that transforms the gradient direction into the EM direction. Our theorem provides such an explicit form for PA. Moreover, we generalize Baum and Sell's results to handle the updates for m, and C,, and we provide explicit expressions for the positive definite transformation matrices P,, and Pc, as well.
134
Lei Xu and Michael Jordan
It is also worthwhile to compare the EM algorithm to other gradientbased optimization methods. Nezutods mrtlzod is obtained by premultiplying the gradient by the inverse of the Hessian of the log likelihood:
Newton's method is the method of choice when it can be applied, but the algorithm is often difficult to use in practice. In particular, the algorithm can diverge when the Hessian becomes nearly singular; moreover, the computational costs of computing the inverse Hessian at each step can be considerable. An alternative is to approximate the inverse by a recursively updated matrix B i k + l ) = B(') r/AB(').Such a modification is called a quasi-Newton method. Conventional quasi-Newton methods are unconstrained optimization methods, however, and must be modified to be used in the mixture setting (where there are probabilistic constraints on the parameters). In addition, quasi-Newton methods generally require that a one-dimensional search be performed at each iteration to guarantee convergence. The EM algorithm can be viewed as a special form of quasi-Newton method in which the projection matrix P( @('I)in equation 3.7 plays the role of B ( k ) .As we discuss in the remainder of the paper, this particular matrix has a number of favorable properties that make EM particularly attractive for optimization in the mixture setting.
+
4 Constrained Optimization and General Convergence
An important property of the matrix P is that the EM step in parameter space automatically satisfies the probabilistic constraints of the mixture model in equation 2.1. The domain of 0 contains two regions that embody the probabilistic constraints: VI = ( 0 : C,"=, ojk) = 1) and V 2 = {(+ : ~ 1 ; ~ 2 ) 0, S, is positive definite}. For the EM algorithm the update for the mixing proportions 0, can be rewritten as follows:
It is obvious that the iteration stays within V,.Similarly, the update for C j can be rewritten as:
which stays within Vz for N sufficiently large. Whereas EM automatically satisfies the probabilistic constraints of a mixture model, other optimization techniques generally require modification to satisfy the constraints. One approach is to modify each iterative
EM Algorithm for Gaussian Mixtures
135
step to keep the parameters within the constrained domain. A number of such techniques have been developed, including feasible direction methods, active sets, gradient projection, reduced-gradient, and linearly constrained quasi-Newton. These constrained methods all incur extra computational costs to check and maintain the constraints and, moreover, the theoretical convergence rates for such constrained algorithms need not be the same as that for the corresponding unconstrained algorithms. A second approach is to transform the constrained optimization problem into an unconstrained problem before using the unconstrained method. This can be accomplished via penalty and barrier functions, Lagrangian terms, or reparameterization. Once again, the extra algorithmic machinery renders simple comparisons based on unconstrained convergence rates problematic. Moreover, it is not easy to meet the constraints on the covariance matrices in the mixture using such techniques. A second appealing property of P ( O ( k ) )is that each iteration of EM is guaranteed to increase the likelihood (i.e., 1(0@+')) 2 I ( @ @ ) ) ) . This monotonic convergence of the likelihood is achieved without step-size parameters or line searches. Other gradient-based optimization techniques, including gradient descent, quasi-Newton, and Newton's method, do not provide such a simple theoretical guarantee, even assuming that the constrained problem has been transformed into an unconstrained one. For gradient ascent, the step size 11 must be chosen to ensure that (JO(k+l)- O ( k - l ) ~ ~ / ~ \-( O O(k-l))ll (k) 5 111 + r/H(@-'))\I < 1. This requires a one-dimensional line search or an optimization of rl at each iteration, which requires extra computation, which can slow down the convergence. An alternative is to fix r/ to a very small value, which generally makes 111 q H ( @ - ' ) ) I/ close to one and results in slow convergence. For Newton's method, the iterative process is usually required to be near a solution, otherwise the Hessian may be indefinite and the iteration may not converge. Levenberg-Marquardt methods handle the indefinite Hessian matrix problem; however, a one-dimensional optimization or other form of search is required for a suitable scalar to be added to the diagonal elements of Hessian. Fisher scoring methods can also handle the indefinite Hessian matrix problem, but for nonquadratic nonlinear optimization Fisher scoring requires a stepsize r/ that obeys I ~ L + T / B H ( O ( ~ - '< ) ) 1, // where B is the Fisher information matrix. Thus, problems similar to those of gradient ascent arise here as well. Finally, for the quasi-Newton methods or conjugate gradient methods, a one-dimensional line search is required at each iteration. In summary, all of these gradient-based methods incur extra computational costs at each iteration, rendering simple comparisons based on local convergence rates unreliable. For large-scale problems, algorithms that change the parameters immediately after each data point ("on-line algorithms") are often significantly faster in practice than batch algorithms. The popularity of gradient descent algorithms for neural networks is in part to the ease of obtaining on-line variants of gradient descent. It is worth noting that on-line
+
136
Lei Xu and Michael Jordan
variants of the EM algorithm can be derived (Neal and Hinton 1993; Titterington 1984), and this is a further factor that weighs in favor of EM as compared to conjugate gradient and Newton methods. 5 Convergence Rate Comparisons
In this section, we provide a comparative theoretical discussion of the local convergence rates of constrained gradient ascent and EM. For gradient ascent a local convergence result can be obtained by Taylor expanding the log likelihood around the maximum likelihood estimate 8".For sufficiently large k we have
and
where H is the Hessian of I, 11 is the step size, and r = max{ll rlX,,,,-H(U*)]I, 11 - I/A~~~[-H(@*)]~}, where &[A] and A,,,[A] denote the largest and smallest eigenvalues of A, respectively. Smaller values of Y correspond to faster convergence rates. To guarantee convergence, we require Y < 1 or 0 < 11 < 2/XM[-H(@*)]. The minimum possible value of r is obtained when r j = l/XM[H(@*)]with
where .[HI = xM[H]/&,,[H]is the coliditioii izirmber of H . Larger values of the condition number correspond to slower convergence. When .[HI = 1 we have r,, = 0, which corresponds to a superlinear rate of convergence. Indeed, Newton's method can be viewed as a method for obtaining a more desirable condition number-the inverse Hessian H-' balances the Hessian H such that the resulting condition number is one. Effectively, Newton can be regarded as gradient ascent on a new function with an effective Hessian that is the identity matrix: H,ff = H-'H = I. In practice, however, .[HI is usually quite large. The larger K [ H ]is, the more difficult it is to compute H-' accurately. Hence it is difficult to balance the Hessian as desired. In addition, as we mentioned in the previous section, the Hessian varies from point to point in the parameter space, and at each iteration we need to recompute the inverse Hessian. Quasi-Newton methods approximate H(O(k)j-' by a positive matrix B(') that is easy to compute. The discussion thus far has treated unconstrained optimization. To compare gradient ascent with the EM algorithm on the constrained mix-
EM Algorithm for Gaussian Mixtures
137
ture estimation problem, we consider a gradient projection method: (5.3) where IIk is the projection matrix that projects the gradient dl/13@(~) into V,.This gradient projection iteration will remain in D1 as long as the initial parameter vector is in V,.To keep the iteration within V2,we choose an initial @(O) E V2 and keep 7/ sufficiently small at each iteration. Suppose that E = [el.. . . .em]is a set of independent unit basis vectors and n,(81/30‘k’) become = that spans the space V1.In this basis, ETO@)and aI/aO$k’= E T ( d l / d @ ( k )respectively, ), with /\@ik’--O;/I= @*]I. In this representation the projective gradient algorithm equation 5.3 becomes simple gradient ascent: Oikfl’ = + ~/(dl/dO?’).Moreover, - O*ll 5 IIE’[I+ ~ H ( o * ) ] l l l l @-~ @*/I. ’ AS a equation 5.1 becomes llO(k+l) result, the convergence rate is bounded by
@gk)
r,
=
i -
+
F M [ E r [ 1+ 277H(O*) v2H2(@*)]E]
Since H ( O * )is negative definite, we obtain
r, 5 41 +q2XL[-H,] -~vX,,[-H,]
(5.4)
In this equation H , = E T H ( 0 ) Eis the Hessian of I restricted to V,. We see from this derivation that the convergence speed depends on K[H,]= XM[-H~]/XJJI[-H,]. When K[H,]= 1, we have
which in principle can be made to equal zero if 71 is selected appropriately. In this case, a superlinear rate is obtained. Generally, however, r;[H,]# 1, with smaller values of K[H,]corresponding to faster convergence. We now turn to an analysis of the EM algorithm. As we have seen EM keeps the parameter vector within Vl automatically. Thus, in the new basis the connection between EM and gradient ascent (cf. equation 3.7) becomes
Lei X u and Michael Jordan
138
The latter equation can be further manipulated to yield
r, I 41 + Ah[ETPHE]- 2A,,,[-ETPHE]
(5.5)
Thus we see that the convergence speed of EM depends on
K [ E ~ P H E=] AM [ETPHE]/A,,, [ETPHE] When
k[ETPHE]= 1.
A M [ E ~ P H E=] 1
we have J1
+ Ah[ETPHE]- 2A,,,[-ETPHE]= (1
-
AM[-E~PHE= ] )0
In this case, a superlinear rate is obtained. We discuss the possibility of obtaining superlinear convergence with EM in more detail below. These results show that the convergence of gradient ascent and EM both depend on the shape of the log likelihood as measured by the condition number. When &[HIis near one, the configuration is quite regular, and the update direction points directly to the solution yielding fast con] very large, the 1 surface has an elongated shape, vergence. When K [ H is and the search along the update direction is a zigzag path, making convergence very slow. The key idea of Newton and quasi-Newton methods is to reshape the surface. The nearer it is to a ball shape (Newton's method achieves this shape in the ideal case), the better the convergence. Quasi-Newton methods aim to achieve an effective Hessian whose condition number is as close as possible to one. Interestingly, the results that we now present suggest that the projection matrix P for the EM algorithm also serves to effectively reshape the likelihood yielding an effective condition number that tends to one. We first present empirical results that support this suggestion and then present a theoretical analysis. We sampled 1000 points from a simple finite mixture model given by
p(x) = f l I P I ( X )
+ 02p2(x)
where
The parameter values were as follows: c q = 0.7170, (v2 = 0.2830, in1 = -2, in2 = 2, n: = 1, = 1. We ran both the EM algorithm and gradient ascent on the data. The initialization for each experiment is set randomly, but is the same for both the EM algorithm and the gradient algorithm. At each step of the simulation, we calculated the condition number of the Hessian ( [H( )I), the condition number determining the rate of convergence of the gradient algorithm (h.[ETH(@)')El), and the condition number de@ ( k ) ) E ] ) .We also termining the rate of convergence of EM (h-[ETP(@))H(
EM Algorithm for Gaussian Mixtures
l
-5000‘ 0
o
I
10
t
20
139
o
I
30
I
40
I
50
o
I
60
I
70
I
80
C
,
90
I
100
the learningsteps
Figure la: Experimental results for the estimation of the parameters of a twocomponent gaussian mixture. (a) The condition numbers as a function of the iteration number. calculated the largest eigenvalues of the matrices H ( O ( k ) ) ,ETH(O@))E, and ETP(O(k))H(O(k))E. The results are shown in Figure 1. As can be seen in Figure la, the condition numbers change rapidly in the vicinity of the 25th iteration. This is because the corresponding Hessian matrix is indefinite before the iteration enters the neighborhood of a solution. Afterward, the Hessians quickly become definite and the condition numbers ~ o n v e r g e As . ~ shown in Figure lb, the condition numbers converge toward the values 6 [ H ( O i k ) )= ] 47.5, K [ E ~ H ( O ( ’ ) ) E=] 33.5, and K [ E ~ P ( @ ( ~ ) ) H ( O=( ~3.6. ) ) EThat ] is, the matrix P has greatly reduced the condition number, by factors of 93 and 13.2. This significantly improves the shape of 1 and speeds up the convergence. 31nterestingly,the EM algorithm converges soon afterward as well, showing that for this problem EM spends little time in the region of parameter space in which a local analysis is valid.
Lei Xu and Michael Jordan
140
73
8 a,
10':
I
I
30
40
50
60 70 the learning steps
80
90
100
Figure lb: (b) A zoomed version of (a) after discarding the first 25 iterations. The terminology "original, constrained, and EM-equivalent Hessians" refers to the matrices H , E ' H E , and ETPIfE, respecti\dy.
We ran a second experiment in which tlie means of tlie component gaussians were 1 1 7 1 = -1 and rti? = 1. The results are similar to those shown in Figure 1. Since tlie distance between two distributions is reduced into half, tlie shape of 1 becomes more irregular (Fig. 2). The condition number K:H((-)"' ) ] increases to 352, ti;EJH((-)'"]Ej increases to 216, and K~E'P((-)''') H ((-3'")E; increases to 61. We see once again a significant improvement in tlie case of EM, by factors of 3.5 and 5.8. Figure 3 shows that tlie matrix P has also reduced the largest eigenvalues of the Hessian from between 2000 to 3000 to around 1. This demonstrates clearly the stable convergence that is obtained via EM, without a line search or tlie need for external selection of a learning stepsize. In tlie remainder of the paper we provide some theoretical analyses that attempt to shed some light on these empirical results. To illustrate the issues involved, consider a degenerate mixture problem in which
EM Algorithm for Gaussian Mixtures
l
o
800 -
o
dash-dot
-...._---
zoo-
a,
o-/-
.-I
c
c
dashed - the constrained Hessian
.-6 400.-5
5
0
solid - the original Hessian
600 -
n
141
-
-. the EM-equivalent Hessian
- - - - - -- - - - - --- - - - - - -- - - - - -
_ _ _ - ------ ---- -- --
-
- - - - - - -
-
-
-400
-600 . -800 -10001
0
1
50
I
100
I
150
I
ZOO
250
300
350
400
I
450
500
the learning steps
Figure 2: Experimental results for the estimation of the parameters of a twocomponent gaussian mixture (cf. Fig. 1). The separation of the gaussians is half the separation in Figure 1.
the mixture has a single component. (In this case (tl = 1.) Let us furthermore assume that the covariance matrix is fixed (i.e., only the mean vector m is to be estimated). The Hessian with respect to the mean m is N = -NC-’ and the EM projection matrix P is C,” For gradient ascent, we have y;[ETHE]= 4-’], which is larger than one whenever C # cl. EM, on the other hand, achieves a condition number of one exactly (r;[ETPHE]= /G[PH]= &.[I]= 1 and &[ETPHE] = 1). Thus, EM and Newton’s method are the same for this simple quadratic problem. For general nonquadratic optimization problems, Newton retains the quadratic assumption, yielding fast convergence but possible divergence. EM is a more conservative algorithm that retains the convergence guarantee but also maintains quasi-Newton behavior. We now analyze this behavior in more detail. We consider the special case of estimating the means in a gaussian mixture when the gaussians are well separated.
142
Lei Xu and Michael Jordan
6 lo2--
r
u
.
8a,
:
5
.
..
#
-
the EM-equivalent Hessian -- _ _ _ _ _ _ _ _ _ _ -
c
I - _ ‘ I‘
10’
I
I
I
I
I
I
Figure 2: Continued.
Theorem 2. Consider the EM algorithm in equation 2.3, where the parameters OP, and S, are assumed to be known. Assume that the K gaussiaiz distribictioizs are zaell separated, such that for sufficiently large k the posterior probabilities hjk’( t ) are approximately zero or one. For suck k, the condition number associated 7idh EM is approximately one, which is smaller than the cona’itioiz number associated with gradient ascent. That is (5.6) (5.7)
Furthermore, u1e have also (5.8)
EM Algorithm for Gaussian Mixtures
143
Figure 3: The largest eigenvalues of the matrices H. ETHE, and ETPHE plotted as a function of the number of iterations. The plot in (a) is for the experiment in Figure 1; (b) is for the experiment reported in Figure 2.
(5.9)
144
Lei Xu and Michael Jordan
the original Hessian
---_-___________-----
I
1o3
.
/
the constrained Hessian
/
/
the EM-equivalentHessian
1oo _ _ -
1°.’b
-
50
I00
I50
250 3;)O the learning steps
200
- -
3iO
400
450
I 5b0
Figure 3: Continued. The plot in (a) is for the experiment in Figure 1; (b) is for the experiment reported in Figure 2. with y,(x“)) = [dl, - k j k ) ( t ) ] k j k ) ( The t ) . projection matrix P is P(k’ = diag[Pi:’. . . . P,,](kl, ~
where
Given that kik’(t)[l-kjk’( t ) ]is negligible for sufficiently large k [since kjk)(t ) are approximately zero or one], the second term in equation 5.10 can be neglected, yielding H I , M -(C;k))-’ Cr=lkF’(t)and H = diag[H11,.. . ,H K K ] . This implies that PH = -I and ETPHE z -I, thus r;[ETPHE]M 1 and 0 XM[ErPHE]M 1, whereas usually h.[ETHE]> 1. This theorem, although restrictive in its assumptions, gives some indication as to why the projection matrix in the EM algorithm appears to
EM Algorithm for Gaussian Mixtures
145
condition the Hessian, yielding improved convergence. In fact, we conjecture that equations 5.7 and 5.8 can be extended to apply more widely, in particular to the case of the full EM update in which the mixing proportions and covariances are estimated, and also, within limits, to cases in which the means are not well separated. To obtain an initial indication as to possible conditions that can be usefully imposed on the separation of the mixture components, we have studied the case in which the second term in equation 5.10 is neglected only for HI,and is retained for the HI, components, where j # i. Consider, for example, the case of a univariate mixture having two mixture components. For fixed mixing proportions and fixed covariances, the Hessian matrix (equation 5.9) becomes
and the projection matrix (equation 5.10) becomes
‘=[
-k211 0
-h;‘O
I
where
and
m l ) , i f j = 1.2 If H is negative definite (i.e., k l l h 2 2 - h12h21 < 0), then we can show that the conclusions of equation 5.7 remain true, even for gaussians that are not necessarily well separated. The proof is achieved via the following lemma: Lemma 1. Consider the positive definite matrix
=
[
011
@12
021
022
I
For the diagonal matrix B
= diag[a,’, OG1],
we have K;[BC]< ~1x1.
Proof. The eigenvalues of C are the roots of ( 0 1 1 which gives AM
A,
= =
3
=
011
+ 022 + Y
011
+ @22 - Y
2
2
+ c72212
d(c~li
- 4(011022 - 021012)
X ) ( a 2 2 - A)
- 0 2 1 0 1 2 = 0,
Lei Xu and Michael Jordan
146
and
.[El
ff11
=
+ ff22 + Y
011 + ffzz - Y The condition number .[El can be written as .[El where s is defined as follows: 4(g11g22 (011
= (1+ s ) / ( l - s )
=f(s),
- g21g12)
+ (722l2
Furthermore, the eigenvalues of BE are the roots of (1 - A ) ( 1 - A ) ( 0 2 1 g 1 2 ) / ( g 1 1 ~ 2 2 ) = 0, which gives AM = 1 and A,, = Thus, defining r = J ( ~ c r 1 2 ) / ( g 1 1 0 2 2 ) , we have
+ )d-
We now examine the quotient s/r: S
r
=
'dl
4(1 - r2)
-
r
(011
+
+ ff2d2/(g11(722)
Given that (011 ~ 7 2 2 ) ~ / ( a l l ( 7 2 2 )2 4, we have s/r > l / r d m = 1. That is, s > r. Since f(x) = (1 x)/(l - x) is a monotonically increasing function for x > 1, we have f ( s ) > f ( r ) . Therefore, h-[BC]< .[El. 0
+
We think that it should be possible to generalize this lemma beyond the univariate, two-component case, thereby weakening the conditions on separability in Theorem 2 in a more general setting. 6 Conclusions
In this paper we have provided a comparative analysis of algorithms for the learning of gaussian mixtures. We have focused on the EM algorithm and have forged a link between EM and gradient methods via the projection matrix P. We have also analyzed the convergence of EM in terms of properties of the matrix P and the effect that P has on the likelihood surface. EM has a number of properties that make it a particularly attractive algorithm for mixture models. It enjoys automatic satisfaction of probabilistic constraints, monotonic convergence without the need to set a learning rate, and low computational overhead. Although EM has the reputation of being a slow algorithm, we feel that in the mixture setting the slowness of EM has been overstated. Although EM can indeed converge slowly for problems in which the mixture components are not well separated, the Hessian is poorly conditioned for such problems and thus other gradient-based algorithms (including Newton's method) are also likely to perform poorly. Moreover, if one's concern is convergence in likelihood, then EM generally performs well even for these illconditioned problems. Indeed the algorithm provides a certain amount
EM Algorithm for Gaussian Mixtures
147
of safety in such cases, despite the poor conditioning. It is also important to emphasize that the case of poorly separated mixture components can be viewed as a problem in model selection (too many mixture components are being included in the model), and should be handled by regularization techniques. The fact that EM is a first-order algorithm certainly implies that EM is no panacea, but does not imply that EM has no advantages over gradient ascent or superlinear methods. First, it is important to appreciate that convergence rate results are generally obtained for unconstrained optimization, and are not necessarily indicative of performance on constrained optimization problems. Also, as we have demonstrated, there are conditions under which the condition number of the effective Hessian of the EM algorithm tends toward one, showing that EM can approximate a superlinear method. Finally, in cases of a poorly conditioned Hessian, superlinear convergence is not necessarily a virtue. In such cases many optimization schemes, including EM, essentially revert to gradient ascent. We feel that EM will continue to play an important role in the development of learning systems that emphasize the predictive aspect of data modeling. EM has indeed played a critical role in the development of hidden Markov models (HMMs), an important example of predictive data modeling4 EM generally converges rapidly in this setting. Similarly, in the case of hierarchical mixtures of experts the empirical results on convergence in likelihood have been quite promising (Jordan and Jacobs 1994; Waterhouse and Robinson 1994). Finally, EM can play an important conceptual role as an organizing principle in the design of learning algorithms. Its role in this case is to focus attention on the "missing variables" in the problem. This clarifies the structure of the algorithm and invites comparisons with statistical physics, where missing variables often provide a powerful analytic tool (Yuille ef al. 1994).
Appendix: Proof of Theorem 1 1. We begin by considering the EM update for the mixing proportions (I]. From equations 2.1 and 2.2, we have
most applications of HMMs, the "parameter estimation" process is employed solely to yield models with high likelihood; the parameters are not generally endowed with a particular meaning.
148
Lei Xu and Michael Jordan
Premultiplying by P?), we obtain
=
l N x[kjk'(t). . , . . k f ' ( t ) I T - A(k)
-
f=1
The update formula for A in equation 2.3 can be rewritten as
Combining the last two equations establishes the update rule for A (equation 2.4). Furthermore, for an arbitrary vector u, we have NuTP$a ) u = uT diag[oik'.. . . . ( y K( k )121 - (u'A('))2.By Jensen's inequality we have
= (idTA(a')2
Thus, uTP$'u > 0 and P$) is positive definite given the constraints C/K_,n;A)= 1 and 0;') 2 0 for all j . 2. We now consider the EM update for the means m,. It follows from equations 2.1 and 2.2 that
Premultiplying by P::: yields
-
4k+1) -m y
From equation 2.3, we have ELl kik'(t) > 0; moreover, X;') is positive definite with probability one assuming that N is large enough such that
EM Algorithm for Gaussian Mixtures
149
the matrix is of full rank. Thus, it follows from equation 3.5 that P$) is positive definite with probability one. 3. Finally, we prove the third part of the theorem. It follows from equations 2.1 and 2.2 that
With this in mind, we rewrite the EM update formula for Er' as
where
That is, we have
Utilizing the identity vec[ABC]= (CT@ A)vec[B],we obtain
Thus P g ) we have
=
A ( E , @ )@ Ey)). Moreover, for an arbitrary matrix U ,
c,"=, qk%)
v e c [ ~ ] ~ (8 ~~ ] kr ) ) v e c [ = ~ ] tr(C]k)Uxjk)UTj =
(k) T (k)U tr[(E1 U1 ( X i 11
=
v e ~ [ C j ~ ' ~ ] ~ v e2co[ ~ j ~ ' ~ ]
where equality holds only when E;kiU= 0 for all U . Equality is impossible, however, since EjkJis positive definite with probability one when N is sufficiently large. Thus it follows from equation 3.6 and Izik)(t)> 0 0 that P$: is positive definite with probability one.
150
Lei Xu and Michael Jordan
Acknowledgments This project was supported in part by a Ho Sin-Hang Education Endowment Foundation Grant and by the HK RGC Earmarked Grant CUHK250/ 94E, by a grant from the McDonnell-Pew Foundation, by a grant from ATR Human Information Processing Research Laboratories, by a grant from Siemens Corporation, by Grant IRf-9013991 from the National Science Foundation, and by Grant N00014-90-J-1942from the Office of Naval Research. Michael I. Jordan is an NSF Presidential Young Investigator. References Amari, S. 1995. Intormation geometry of the EM and em algorithms for neural networks. Ntwnl Ncticorks 8, (5) (in press). Baum, L. E., and Sell, G. R. 1968. Growth transformation for functions on manifolds. Pas. I. Math. 27, 211-227. Bengio, Y., and Frasconi, I?, 1995. An input-output HMM architecture. In Arfiwiices iii Neirrd [rtforiir~~fioit P ~ C J C M I ~S Iy Ss t c w s 7, G. Tesauro, D. S. Touretzky, and J. Alspector, eds. MIT Press, Cambridge MA. Boyles, R. A. 1983. On the convergence of the EM algorithm. 1. Roynl Stnt. Soc. B45(1), 47-50. Dempster, A . P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. 1. Roynl Stat. Soc-. 839, 1-38. Ghahramani, Z. and Jordan, M. I. 1994. Function approximation via density estimation using the EM approach. I n Adiniws irr Nrrrrnl Iriforim7fic1r7Processiiig Systtv7.s 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 120-127. Morgan Kaufmann, San Mateo, C.4. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Ntwral Cotrip. 6, 181-214. Jordan, M. l., and X u , L. 1995. Convergence results tor the EM approach to mixtures-of-experts architectures. Ntvrrnl N r f i i ~ o t . k(in press). Levinson, S. E., Rabiner, L. R., and Sondhi, M. M. 1983. An introduction to the application of the theory of probabilistic functions of Markov process to automatic speech recognition. Bell 5y.q. E d ~ i i i/.. 62, 1035-1072. Neal, R. N., and Hinton, G. E. 1993. A N r i l ~Vitwl c,f tlic EM Algoritliiir flint Jirsfifirs 117~rcrri~7iit~71 a i d Otlw L47rIiiiif~. University of Toronto, Department of Computer Science preprint. Nowlan, S. J. 1991. Soft Corirpfitiiv z4ib~pt~ific~~r: Ncirrsl N c ~ t i [ ~ Levrriirig ~~rk Alp rithiis B n d C J ~ IFittirry Stnfisfiinl MiytrrrcJs. Tech. Rep. CMU-CS-91-126, CMU, Pittsburgh, PA. Redner, R. A,, and Walker, H. E 1984. Mixture densities, maximum likelihood, and the EM algorithm, SIAM R c i ~ .26, 195-239. Titterington, D. M. 1984. Recursive parameter estimation using incomplete data. 1. of Royni S f n f . Soi. 846, 257-267. Tresp, V.,Ahmad, S., and Neuneier, R. 1994. Training neural networks with deficient data. In Ah(iiiws irr Ntwral Irrfor~irotiorr Proicsiiir,y S y s t m s 6, J. D.
EM Algorithm for Gaussian Mixtures
151
Cowan, G. Tesauro, and J. Alspector, eds. Morgan Kaufmann, San Mateo, CA. Waterhouse, S. R., and Robinson, A. J. 1994. Classification using hierarchical mixtures of experts. Proc. IEEE Workshop on Neural Networks for Signal Processing, pp. 177-186. Wu, C. F. J. 1983. On the convergence properties of the EM algorithm. Ann. Stat. 11,95-103. Xu, L., and Jordan, M. I. 1993a. Unsupervised learning by EM algorithm based on finite mixture of Gaussians. Proc. WCNN'93, Portland, OR, 11, 431434. Xu, L., and Jordan, M. I. 1993b. EM learning on a generalized finite mixture model for combining multiple classifiers. Proc. WCNN'93, Portland, OR, IV, 227-230. Xu, L., and Jordan, M. I. 1993c. Theoretical and Experimental Studies of the EM Algorithm for Unsupeivised Learning Based on Finite Gaussian Mixtures. MIT Computational Cognitive Science, Tech. Rep. 9302, Dept. of Brain and Cognitive Science, MIT, Cambridge, MA. Xu, L., Jordan, M. I., and Hinton, G. E. 1994. A modified gating network for the mixtures of experts architecture. Proc. WCNN'94, San Diego, 2, 405-410. Yuille, A. L., Stolorz, P., and Utans, J. 1994. Statistical physics, mixtures of distributions and the EM algorithm. Neural Comp. 6, 334-340.
Received November 17, 1994; accepted March 28, 1995.
This article has been cited by: 2. Xiao-liang Tang, Min Han. 2010. Semi-supervised Bayesian ARTMAP. Applied Intelligence 33:3, 302-317. [CrossRef] 3. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 4. Lei Xu. 2010. Machine learning problems from optimization perspective. Journal of Global Optimization 47:3, 369-401. [CrossRef] 5. Behrooz Safarinejadian, Mohammad B. Menhaj, Mehdi Karrari. 2010. A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowledge and Information Systems 23:3, 267-292. [CrossRef] 6. D. P. Vetrov, D. A. Kropotov, A. A. Osokin. 2010. Automatic determination of the number of components in the EM algorithm of restoration of a mixture of normal distributions. Computational Mathematics and Mathematical Physics 50:4, 733-746. [CrossRef] 7. Erik Cuevas, Daniel Zaldivar, Marco Pérez-Cisneros. 2010. Seeking multi-thresholds for image segmentation with Learning Automata. Machine Vision and Applications . [CrossRef] 8. Roy Kwang Yang Chang, Chu Kiong Loo, M. V. C. Rao. 2009. Enhanced probabilistic neural network with data imputation capabilities for machine-fault classification. Neural Computing and Applications 18:7, 791-800. [CrossRef] 9. Guobao Wang, Larry Schultz, Jinyi Qi. 2009. Statistical Image Reconstruction for Muon Tomography Using a Gaussian Scale Mixture Model. IEEE Transactions on Nuclear Science 56:4, 2480-2486. [CrossRef] 10. Hyeyoung Park, Tomoko Ozeki. 2009. Singularity and Slow Convergence of the EM algorithm for Gaussian Mixtures. Neural Processing Letters 29:1, 45-59. [CrossRef] 11. Siddhartha Ghosh, Dirk Froebrich, Alex Freitas. 2008. Robust autonomous detection of the defective pixels in detectors using a probabilistic technique. Applied Optics 47:36, 6904. [CrossRef] 12. O. Michailovich, A. Tannenbaum. 2008. Segmentation of Tracking Sequences Using Dynamically Updated Adaptive Learning. IEEE Transactions on Image Processing 17:12, 2403-2412. [CrossRef] 13. Dongbing Gu. 2008. Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks. IEEE Transactions on Neural Networks 19:7, 1154-1166. [CrossRef] 14. C.K. Reddy, Hsiao-Dong Chiang, B. Rajaratnam. 2008. TRUST-TECH-Based Expectation Maximization for Learning Finite Mixture Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:7, 1146-1157. [CrossRef] 15. Michael J. Boedigheimer, John Ferbas. 2008. Mixture modeling approach to flow cytometry data. Cytometry Part A 73A:5, 421-429. [CrossRef]
16. Michael Lynch, Ovidiu Ghita, Paul F. Whelan. 2008. Segmentation of the Left Ventricle of the Heart in 3-D+t MRI Data Using an Optimized Nonrigid Temporal Model. IEEE Transactions on Medical Imaging 27:2, 195-203. [CrossRef] 17. Xing Yuan, Zhenghui Xie, Miaoling Liang. 2008. Spatiotemporal prediction of shallow water table depths in continental China. Water Resources Research 44:4. . [CrossRef] 18. A. Haghbin, P. Azmi. 2008. Precoding in downlink multi-carrier code division multiple access systems using expectation maximisation algorithm. IET Communications 2:10, 1279. [CrossRef] 19. Estevam R. Hruschka, Eduardo R. Hruschka, Nelson F. F. Ebecken. 2007. Bayesian networks for imputation in classification problems. Journal of Intelligent Information Systems 29:3, 231-252. [CrossRef] 20. Oleg Michailovich, Yogesh Rathi, Allen Tannenbaum. 2007. Image Segmentation Using Active Contours Driven by the Bhattacharyya Gradient Flow. IEEE Transactions on Image Processing 16:11, 2787-2801. [CrossRef] 21. J. W. F. Robertson, C. G. Rodrigues, V. M. Stanford, K. A. Rubinson, O. V. Krasilnikov, J. J. Kasianowicz. 2007. Single-molecule mass spectrometry in solution using a solitary nanopore. Proceedings of the National Academy of Sciences 104:20, 8207-8211. [CrossRef] 22. Chunhua Shen, Michael J. Brooks, Anton van den Hengel. 2007. Fast Global Kernel Density Mode Seeking: Applications to Localization and Tracking. IEEE Transactions on Image Processing 16:5, 1457-1469. [CrossRef] 23. Miguel A. Carreira-Perpinan. 2007. Gaussian Mean-Shift Is an EM Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:5, 767-776. [CrossRef] 24. Yuanqing Li, Cuntai Guan. 2006. An Extended EM Algorithm for Joint Feature Extraction and Classification in Brain-Computer InterfacesAn Extended EM Algorithm for Joint Feature Extraction and Classification in Brain-Computer Interfaces. Neural Computation 18:11, 2730-2761. [Abstract] [PDF] [PDF Plus] 25. L.C. Khor. 2006. Robust adaptive blind signal estimation algorithm for underdetermined mixture. IEE Proceedings - Circuits, Devices and Systems 153:4, 320. [CrossRef] 26. Carlos Ordonez, Edward Omiecinski. 2005. Accelerating EM clustering to find high-quality solutions. Knowledge and Information Systems 7:2, 135-157. [CrossRef] 27. J. Fan, H. Luo, A.K. Elmagarmid. 2004. Concept-Oriented Indexing of Video Databases: Toward Semantic Sensitive Retrieval and Browsing. IEEE Transactions on Image Processing 13:7, 974-992. [CrossRef] 28. Balaji Padmanabhan, Alexander Tuzhilin. 2003. On the Use of Optimization for Data Mining: Theoretical Interactions and eCRM Opportunities. Management Science 49:10, 1327-1343. [CrossRef]
29. Meng-Fu Shih, A.O. Hero. 2003. Unicast-based inference of network link delay distributions with finite mixture models. IEEE Transactions on Signal Processing 51:8, 2219-2228. [CrossRef] 30. R.D. Nowak. 2003. Distributed EM algorithms for density estimation and clustering in sensor networks. IEEE Transactions on Signal Processing 51:8, 2245-2253. [CrossRef] 31. Sin-Horng Chen, Wen-Hsing Lai, Yih-Ru Wang. 2003. A new duration modeling approach for mandarin speech. IEEE Transactions on Speech and Audio Processing 11:4, 308-320. [CrossRef] 32. Y. Matsuyama. 2003. The α-EM algorithm: surrogate likelihood maximization using α-logarithmic information measures. IEEE Transactions on Information Theory 49:3, 692-706. [CrossRef] 33. M.A.F. Figueiredo, A.K. Jain. 2002. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:3, 381-396. [CrossRef] 34. Zheng Rong Yang, M. Zwolinski. 2001. Mutual information theory for adaptive mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:4, 396-403. [CrossRef] 35. H. Yin, N.M. Allinson. 2001. Self-organizing mixture networks for probability density estimation. IEEE Transactions on Neural Networks 12:2, 405-411. [CrossRef] 36. H. Yin, N.M. Allinson. 2001. Bayesian self-organising map for Gaussian mixtures. IEE Proceedings - Vision, Image, and Signal Processing 148:4, 234. [CrossRef] 37. Qiang Gan, C.J. Harris. 2001. A hybrid learning scheme combining EM and MASMOD algorithms for fuzzy local linearization modeling. IEEE Transactions on Neural Networks 12:1, 43-53. [CrossRef] 38. C.J. Harris, X. Hong. 2001. Neurofuzzy mixture of experts network parallel learning and model construction algorithms. IEE Proceedings - Control Theory and Applications 148:6, 456. [CrossRef] 39. Jinwen Ma , Lei Xu , Michael I. Jordan . 2000. Asymptotic Convergence Rate of the EM Algorithm for Gaussian MixturesAsymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures. Neural Computation 12:12, 2881-2907. [Abstract] [PDF] [PDF Plus] 40. Dirk Husmeier . 2000. The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural NetworksThe Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks. Neural Computation 12:11, 2685-2717. [Abstract] [PDF] [PDF Plus] 41. Ashish Singhal, Dale E. Seborg. 2000. Dynamic data rectification using the expectation maximization algorithm. AIChE Journal 46:8, 1556-1565. [CrossRef] 42. P. Hedelin, J. Skoglund. 2000. Vector quantization based on Gaussian mixture models. IEEE Transactions on Speech and Audio Processing 8:4, 385-401. [CrossRef]
43. Man-Wai Mak, Sun-Yuan Kung. 2000. Estimation of elliptical basis function parameters by the EM algorithm with application to speaker verification. IEEE Transactions on Neural Networks 11:4, 961-969. [CrossRef] 44. M. Zwolinski, Z.R. Yang, T.J. Kazmierski. 2000. Using robust adaptive mixing for statistical fault macromodelling. IEE Proceedings - Circuits, Devices and Systems 147:5, 265. [CrossRef] 45. Athanasios Kehagias , Vassilios Petridis . 1997. Time-Series Segmentation Using Predictive Modular Neural NetworksTime-Series Segmentation Using Predictive Modular Neural Networks. Neural Computation 9:8, 1691-1709. [Abstract] [PDF] [PDF Plus] 46. A.V. Rao, D. Miller, K. Rose, A. Gersho. 1997. Mixture of experts regression modeling by deterministic annealing. IEEE Transactions on Signal Processing 45:11, 2811-2820. [CrossRef] 47. James R. Williamson . 1997. A Constructive, Incremental-Learning Network for Mixture Modeling and ClassificationA Constructive, Incremental-Learning Network for Mixture Modeling and Classification. Neural Computation 9:7, 1517-1543. [Abstract] [PDF] [PDF Plus] 48. John J. Kasianowicz, Sarah E. Henrickson, Jeffery C. Lerman, Martin Misakian, Rekha G. Panchal, Tam Nguyen, Rick Gussio, Kelly M. Halverson, Sina Bavari, Devanand K. Shenoy, Vincent M. StanfordThe Detection and Characterization of Ions, DNA, and Proteins Using Nanometer-Scale Pores . [CrossRef]
Communicated by Steve Nowlan and Richard Lippmann
A Comparison of Some Error Estimates
for Neural Network Models Robert Tibshirani Department of Prezleiitiue Medicine niid Biostntistics and Depnrtineizt of Statistics, Uniziersity of Toronto, Toronto, Ontorio, Cniiadn
We discuss a number of methods for estimating the standard error of predicted values from a multilayer perceptron. These methods include the delta method based on the Hessian, bootstrap estimators, and the "sandwich" estimator. The methods are described and compared in a number of examples. We find that the bootstrap methods perform best, partly because they capture variability due to the choice of starting weights. 1 Introduction We consider a multilayer perceptron with one hidden layer and a linear output layer. See Lippman (1989), and Hinton (19891, and Hertz ef al. (1991) for details and references. A perceptron is a nonlinear model for predicting a response y based on p measurements of predictors (or input patterns or features) X I . x2. . . . x~,.For convenience we assume that x1 = 1. The model with H hidden units has the form /
H
\
where the errors f have mean zero, variance c2 and are independent across training cases. Since Y is a continuous response variable, we take the output function 00 to be the identity. The standard choice for the hidden layer output function Q is the sigmoid =
1 1 exp (-x)
(1.2)
+
Our training sample has n observations (XIy1). . . . (xll.yil). Denote the ensemble of parameters (weights) by 0 = ( 7 ~ 07u1. . . . . ZOH. ,111 . . . #$H) and let y ( x l ;0) be the predicted value for input xI and parameter 0. The total number of parameters is p . H 1 H. ~
+ +
Nrirral Corripufafiori 8, 152-163 (1996) @ 1995 Massachusetts Institute of Technology
Error Estimates for Neural Network Models
153
Estimation of H is usually carried out by minimization of C[yl y(xl; with either early stopping or some form of regularization to prevent overfitting. Commonly used optimization techniques include backpropagation (gradient descent), conjugate gradients, and quasi-Newton (variable metric) methods. Since the dimension of H is usually quite large, search techniques requiring computation of the Hessian are usually impractical. In this paper we focus on the problem of estimation of the standard error of the predicted values y(6;xl). A reference for these techniques is Efron and Tibshirani (1993), especially Chapter 21. One approach is through likelihood theory. If we assume that the errors in model 1.1 are distributed as N(0. a2),then the log-likelihood is
@)Iz,
t(H)
=
1 -g C[yl 1=1
-
1 y(x; H ) ] 2 - - log cT2 2
We eliminate 0’ by replacing it by 6’= C:’,l[yl - y(x;H)]’/n in C(#). first and second derivatives have the form
(1.3) The
The exact form of these derivatives is simple to derive for a neural network, and we do not give them here. Because of the structure of the network, the only nonzero second derivative terms are those of the form a2y/3/~,,,d,&, and a2y/3wh3&, and there are a total of H . p 2 + H . p such terms. Buntine and Weigend (1994) describe efficient methods for computing the Hessian. Let I equal -a2t/dHkdH~ evaluated at H = 4 (the negative Hessian or ”observed information” matrix), and gl = 3y(x,; 0)/i)H evaluated at 6. Then using a Taylor series approximation we obtain
G [y(xl;
e)]
RZ
[g; . I-’ . gl]
(1.5)
This is often called the delta method estimate of standard error (see Efron and Tibshirani 1993, Chapter 21). For computational simplicity, we can leave out the terms in 1.4 involving second derivatives. These are often small because the multipliers yl - y(xl; 0) tend to be small. We will denote the resulting approximate information matrix by i. With weight decay induced by a penalty term XCH:, it might be preferable to use the Hessian of the regularized log-likelihood P ( H ) X C 8;. This simply replaces I by i 2X in formula 1.5, and will tend to reduce the delta method standard error estimates. This is the approach taken in MacKay (1992).
+
154
Robert Tibshirani
2 The Sandwich Estimator
Like the information-based approach, the sandwich estimator has a closed form. Unlike the information however, its derivation does not rely on model correctness and hence it can potentially perform well under modelmisspecification. Let s, = (sll.s,z. . . .) be the gradient vector of t for the ith observation: (2.1) Then the sandwich estimator of variance of
is defined by (2.2)
To estimate the standard error of y(xi;b), we substitute Vssndfor equation 1.5:
i-'
in
Note that C;'sisT estimates E%$$'. The idea behind the sandwich estimator is the following. If the model is specified correctly,
Therefore Vsand E j-'€(I)j-' zz I-' if the model is correct. Suppose, however, that the expected value of Y is modeled correctly but the errors have different variances. Then the sandwich estimator still provides a consistent estimate of variance, but equation 2.4 does not hold and hence the inverse information is not consistent. Details may be found in Kent (1982) and Efron and Tibshirani (1993, Chapter 21). 3 Bootstrap Methods
A different approach to error estimation is based on the bootstrap. It works by creating many pseudoreplicates ("bootstrap samples") of the training set and then reestimating 0 on each bootstrap sample. There are two different ways of bootstrapping in regression settings. One can consider each training case as a sampling unit, and sample with replacement from the training set cases to create a bootstrap sample. This is often called the "bootstrap pairs" approach. On the other hand, one can consider the predictors as fixed, treat the model residuals yI - yi as the sampling units, and create a bootstrap sample by adding residuals to the model fit yl. This is called the "bootstrap residual" approach. The details are given below:
Error Estimates for Neural Network Models
155
Bootstrap pairs sampling algorithm 1. Generate B samples, each one of size n drawn with replacement yll)}. Defrom the n training observations {(XI. yl). (XZ.y2). . . . (x,~. note the bth sample by { (x;'. yi"). (x;". Y;~). . . . (x;'. yEb)}. 2. For each bootstrap sample b = 3 . . . . B, minimize CYtl [yf' -y(x;; H ) I 2 giving 4*l1. 3. Estimate the standard error of the ith predicted value by
where y(xi; .) = C;==,y(xl; 8*")/B. Bootstrap residual sampling algorithm 1. Estimate 8 from the training sample and let y i = yi - y(xl;8). i = 1.2. . . . n. 2. Generate B samples, each one of size n drawn with replacement from y1.r2. . . .Y,,. Denote the bth sample by rf".v;". . . and let yfb = y(xl; 8) + r;! 3. For each bootstrap sample b = 1. . . . B, minimize C:I=l[y;h - y(xi; Q)I2 giving JCR. 4. Estimate the standard error of the ith predicted value by
where y(x,; .)
=
If=, y(xl; &")/B.
Note that each method requires refitting of the model (retraining the network) B times. Typically B is in the range 20 5 B 5 200. In simple linear least squares regression, it can be shown that both the information-based estimate (equation 1.5)and the bootstrap residual sampling estimate (as B m) both agree with the standard least squares ~ ]denoting '/~, the design matrix having rows x,. formula [ X ~ ( X ~ X ) - ' X , ~ X How do the two bootstrap approaches compare? The bootstrap residual procedure is model-based, and relies on the fact that the errors yl - yI are representative of the true model errors. If the model is either misspecified or overfit, the bootstrap pairs approach is more robust. On the other hand, the bootstrap pairs approach results in a different set of predictor values in each bootstrap sample, and in some settings, this may be inappropriate. In some situations the set of predictor values is chosen by design, and we wish to condition on those values in our inference procedure. Such situations are fairly common in statistics (design of experiments) but probably less common in applications of neural networks. --f
1-56
Robert Tibshirani
4 Examples -
In the following examples we compare a number of different estimates of the standard error of predicted values. The methods are as follows: 1. Delta: the delta method (equation 1.5) 2. Delta,: the approximate delta method, using the approximate information matrix Z that ignores second derivatives. 3. Delta?: the delta niethod, adding the term 2X (from the regularization penalty) to the diagonal of the Hessian 4.Sand: the sandwich estimator (equation 2.3) 5. Sand,: the approximate sandwich estimator that uses i in place of I in equations 2.2 and 2.3 6. Bootp: bootstrapping pairs 7. Bootr: bootstrapping residuals
We used Brian Ripley’s “nnetl” S-language function for the fitting, which uses the BFGS variable metric optimizer, with weight decay parameter set at 0.01. The optimizer is based on the Pascal code given in Nash (1979). Only B = 20 bootstrap replications were used. This is a lower limit on the number required in most bootstrap applications, but a perhaps a reasonable number when fitting a complicated model like a neural network. In the simulation studies (Examples 2-_5), we carried out 25 simulations of each experiment. 4.1 Example 1: Air Pollution Data. I n this first example we illustrate the preceding techniques on 111 observations on air pollution, taken from Chambers and Hastie (1991). The goal is to predict ozone concentration from radiation, temperature axid wind speed. We fit a multilayer perceptron with one hidden layer of 3 hidden units, and a linear output layer. The various estimates of standard error, at five randomly chosen feature vectors, are shown in Table 1. Notice that the larger standard errors are given by the bootstrap methods in four of the five cases. As we will see in the simulations below, this is partly because the bootstrap captures the \-ariahility due to the choice of random starting weights. In this example, repeatcd training of the neural network with different starting weights resulted in an average standard error of 0.07 for the predicted values. One potential source of bias in the delta method estimate is our use of the maximum likelihood estimate for n7,namely ;T2 = [y, -y(x; 0)]2/12. We could instead use an unbiased estimate of the form 0’ = C:’,,[yi y ( x ; f l ) . ’ ; ( i i - k ) , where k is an estimate of the number of effective parameters used by the network. However in this example, an upper bound for k is 4 x 3 T 3 i1 = 16, and hence n increases only by a factor of (111,’95)’’2= 1.07. There is more information from the bootstrap process besides the estimated standard errors. Figure 1 shows boxplots of the predicted values
x::,
Error Estimates for Neural Network Models
157
Table 1: Results for Example 1-Air Pollution Data: Standard Error Estimates at Five Randomly Chosen Feature Vectors. Point Method Delta Delta1 Delta2 Sand Sandl Bootp Bootr
1
2
3
4
5
0.15 0.13 0.13 0.14 0.10 0.28 0.19
0.13 0.12 0.12 0.12 0.11 0.26 0.24
0.24 0.16 0.17 0.32 0.21 0.56 0.24
0.38 0.12 0.24 0.29 0.12 0.23 0.23
0.20 0.17 0.17 0.25 0.20 0.25 0.24
at each of the 5 feature vectors. Each boxplot contains values from 50 bootstrap simulations. Notice for example point 3 in the bottom plot. Its predicted values are skewed upward, and so we are less sure about the upper range of the prediction than the lower range. 4.2 Example 2: Fixed-X Sampling. In this example we define X I ,
xz, x3, x4 to be multivariate gaussian with mean zero, variance 1, and pairwise correlation 0.5. This predictor set was generated once and then fixed for all 25 of the simulations. We generated y as where E is gaussian with mean zero and standard deviation 0.7. This gave a signal-to-noise ratio of roughly 1.2. There were 100 observations in each training set. Note that this function could be modeled exactly by a sigmoid net with two hidden nodes and a linear output node. The results are shown in Table 2. In the left half of the table, a perceptron with one hidden layer of 2 hidden units, and a linear output was fit. In the right half, the perceptron had only one hidden unit in the hidden layer. Let j.,k be the estimated standard deviation of y(xl;j),for the kth simulated sample. Then we define sek = median,(slk),the median over the training cases of the estimated standard deviation of yl, for the kth simulated sample. Let s, be the actual standard deviation of y(xl;8). The actual value of the median standard deviation med(s,) is 0.86, as estimated over the 25 simulations. To measure the absolute error of the estimate over each of the training cases, we define ek = median,Is, - &kI. In the left half of the table the two bootstrap methods are clearly superior to the other methods. The ”Random weight se” of 0.39 is the standard error due solely to the choice of starting weights, estimated by fixing the data and retraining with different initial weights. This component of variance is missed by the first four methods. In the right half
Robert Tibshirani
158
x
1
i
1
2
3
4
5
4
5
bootstrap residuals
1
2
3
bootstrap pairs
Figure I : Boxplots of bootstrap replications for each of five randomly chosen feature vectors, from example 1. The bold dot in each box indicates the median, while tlie lower and upper edges are the 25 and 75': percentiles. The broken lines are the hinges, beyond which points are considered to be outliers.
of the table, all of the e5timates have average values of L'L. Surprisingly, the bootstrap residual method is closest on tlie average to tlie actual se, closer than the bootstrap pairs approach. This may be because the bootstrap pairs method varies the X values and hence inflates the variance compared to the fixed->( sampling variance. 4.3 Example 3: Random-X Sampling. The setup in this example is the same a s in the last one, except that a new set of predictor values was generated for each simulation. The predictions were done at a fixed
Error Estimates for Neural Network Models Table 2: Results for Example 2-Fixed-X
159
Samplinga
Correct model (2 hidden units)
Incorrect model (1 hidden unit)
Method
Mean (SD) of sek
Mean (SD) of ek
Mean (SD) of sek
Mean (SD) of ek
Delta Delta, Deltaz Sand Sandl Bootp Bootr
0.39(.08) 0.35(.07) 0.36(.09) 0.40(.06) 0.39(.09) 0.93(.09) 0.74(.07)
0.39(.06) 0.42(,071 0.41 (.05) 0.41(.04) 0.42(.08) 0.17(.05) 0.15(.04)
0.41 (.09) 0.43(.09) 0.41(.09) 0.41(.08) 0.47(.14) 0.72(.14) 0.58(.06)
0.15(.07) 0.15(.08) 0.15(.08) 0.17(.05) 0.18(.06) 0.22(.09) 0.16(.04)
Actual SE Random weight se
0.86(-)
0.56(-)
0.39(-)
0.38(-)
“See text for details.
Table 3: Results for Example 3-Random-X
Sampling.”
Correct model (2 hidden units)
Incorrect model (1 hidden unit)
Mean (SD) of sek
Mean (SD) of ek
Mean (SD) of sek
Mean (SD) of q
Delta Delta, Delta2 Sand Sand, Bootp Bootr
0.38(.08) 0.35(.11) 0.34(.08) 0.39(.08) 0.46(.17) 1.05(.11) 0.81(.09)
0.45(.05) 0.48(.09) 0.47(.09) 0.47(.06) 0.45(.08) 0.26(.06) 0.22(.05)
0.38(.07) 0.34(.14) 0.33(.14) 0.39(.08) 0.49t.24) 0.73(.12) 0.53(.14)
0.36(.09) 0.38(.15) 0.38(.15) 0.34(.07) 0.40(.13) 0.21(.06) 0.24(.04)
Actual SE Random weight se
0.87(-)
-
0.76(-)
-
0.39(-)
-
0.38(-)
-
Method
Osee text for details.
set of predictor values, however, to allow pooling across simulations. The bootstrap methods perform the best again: surprisingly, the bootstrap pairs method only does best in the ”incorrect model” case. This
Robert Tibshirani
3 60
Table 4: Results for Example +-Overfitting." Method
Mean (SD) of s q
Mean (SD) of e i
Bootp
2.2N.13)
Bootr
1.23(.06)
0.61(.09) 0.62(,071
Actual SE
1.8'4-1 0.52(-)
Random weight SE
-
-
'See text for detail5
is surprising because its resampling of tlie predictors matches the actual simulation sampling used in the example.
4.4 Example 4: Overfitting. In this example, tlie setup is the same as in the left hand side of Table 3, except that the neural net was trained with 7 hidden units and no weight decay. Thus the model has 5 more units than is necessary, and with no weight decay, should overfit the training data. The results of tlie simulation experiment are shown in Table 4. We had difficulty in computing tlie inverse information matrix due to near siiigularities in the models, and hence report only tlie bootstrap results. As expected, tlie bootstrap residual method underestimates the true standard error because the overfitting has biased the residuals toward zero. In the extreme cast', if we were to completely saturate the model, the residuals would all be zero and the resulting standard error estimate would also be zero. Tlie bootstrap pairs niethod seems to capture the variation better, but suffers from excess variability across simulations. 4.5 Example 5: Averaging over Runs. The setup here is the same as in the left hand side of Table 2, except that the training is done by averaging the predicted values over three runs with different random starting weights. The results are shown in Table 5. The bootstrap methods still perform the best, but by a lesser amount than before. The reason is that the \,ariation due to the choice of random starting weights has been reduced by the averaging. Presumably, if we were to average over a larger number of runs, this variation would be further reduced. 5 Discussion
~.
In the simulation experiments of this paper, we found that 0
Tlie bootstrap methods provided the most accurate estimates of the standard errors of predicted values.
Error Estimates for Neural Network Models
161
Table 5: Results for Example 5-Averaging over Runs.O Method
Mean (SD) of sek
Mean (SD) of ek
Delta Delta, Delta2 Sand Sandl Bootp Bootr
0.37(.lo) 0.30(.10) 0.35(.09) 0.38(.06) 0.52(.29) 0.68(.06) 0.57(.02)
0.24(.07) 0.30(.07) 0.25(.06) 0.38(.05) 0.52(.08) 0.16(.03) 0.13(.01)
Actual SE Random weight SE
0.61 (-) 0.25(-)
"See text for details. 0
The nonsimulation methods (delta method, sandwich estimator) missed the substantial variability due to the random choice of starting values.
Of course the results found here may not generalize to all other applications of neural networks. For example, the nonsimulation approaches may work better with fitting methods that are less sensitive to the choice of starting weights. Larger training sets, and the use of gradient descent methods, will probably lead to fewer local minima and hence less dependence on the random starting weights than seen here. In addition, in very large problems the bootstrap approaches may require too much computing time to be useful. Note that the bootstrap methods illustrated here do not suffer from matrix inversion problems in overfit networks, and do not require the existence of derivatives. It is important to note that an interval formed by taking say plus and minus 1.96 times a standard error estimate from this paper would be an approximate confidence interval for the mean of a predicted value. This differs from a prediction interval, which is an interval for a future realization of the process. A prediction interval is typically wider than a confidence interval, because it must account for the variance of the future realization. Such an interval can be produced by increasing the width of the confidence interval by an appropriate function of the noise variance 62.
We have considered only regression problems here, but the methods generalize easily to classification problems. With k classes, one usually specifies k output units, each with a sigmoidal output function 4 0 and minimize either squared error or the multinomial log-likelihood (crossentropy). The only nontrivial change occurs for the bootstrap residual method. There are no natural residuals for classification problems, and instead we proceed as follows. Suppose for simplicity that we have
162
Robert Tibshirani
two classes 0, and 1, and let f i ( x I ) be the estimated probability that y equals one for feature vector x,. We fix each x, and generate Bernoulli random variables y;” according to Prob(y;” = 1 ) = p ( x l ) , for i = 1.., . I I and b = 1... . B. Then w e proceed as in steps 3 and 4 of the bootstrap residual sampling algorithm, using either squared error or cross-entropy in step 3. An application of this procedure is described in Baxt and White (1994). A Bayesian approach to error estimation in neural networks may be found in Buntine and Weigend (1991) and MacKay (1992). Nix and Weigend (1994) propose a method for estimating the variance of the target, allowing it to vary as a function of the input features. LeBaron and Weigend (Snowbird 1994) propose a method similar to the bootstrap pairs approach that uses a test set to generate the predicted values. Leonard et al. (1992) describe an alternative approach to confidence interval estimation that can be applied to radial basis networks.
Acknowledgments The author thanks Richard Lippmann, Andreas Weigend, and two referees for their valuable comments, and acknowledges the Natural Sciences and Engineering Research Council of Canada for its support.
References Baxt, W., and White, H. 1994. Bootstrapping Confidence Intervals for Clinical Input Varinble Effects in a Network Trained to Identify the Presence of Acute Myocardial Irzfarction. Tech. Rep., University of California, San Diego. Buntine, W., and Weigend, A. 1994. Computing second derivatives in feed forward neural networks: A review. lEEE Trans. N e w . Netzilorks 5, 480488. Chambers, J., and Hastie, T. 1991. Statistical Models i n S . Wadsworth/Brooks Cole, Pacific Grove, CA. Efron, B., and Tibshirani, R. 1993. An Introdirction to the Bootstrap. Chapman and Hall, London. Hertz, J., Krogh, A., and Palmer, R. 1991. Introduction to the Theory of Neural Coinputation. Addison-Wesley, Redwood City, CA. Hinton, G. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185-234. Kent, T. 1982. Robust properties of likelihood ratio tests. Biometriku 69, 19-27. LeBaron, A., and Weigend, A. 1994. Evaluating neural network predictors by bootstrapping. In Proceedings of the International Conference on Neural Information Processing (lCONIP’94), pp. 1207-1212. Seoul, Korea. Leonard, J., Kramer, M., and Ungar, L. 1992. A neural network architecture that computes its own reliability. Conzp. Chem. Eng. 16, 819-835. Lippman, R. 1989. Pattern classification using neural networks. l E E E Cotmrun. Mag. 11, 47-64.
Error Estimates for Neural Network Models
163
MacKay, D. 1992. A practical bayesian framework for backpropagation neural networks. Neural Comp. 4,448472. Nash, J. 1979. Compact Numerical Methods for Computers. Halsted, New York. Nix, D., and Weigend, A. 1994. Estimating the mean and variance of a target probability distribution. In Proceedings of the IJCNN, Orlando.
Received April 30, 1994; accepted March 15, 1995.
This article has been cited by: 2. Sim Won Lee, Dong Su Kim, Man Gyun Na. 2010. Prediction of DNBR Using Fuzzy Support Vector Regression and Uncertainty Analysis. IEEE Transactions on Nuclear Science 57:3, 1595-1601. [CrossRef] 3. G. Joya, Francisco García-Lagos, F. Sandoval. 2010. Contingency evaluation and monitorization using artificial neural networks. Neural Computing and Applications 19:1, 139-150. [CrossRef] 4. Dong Hyuk Lim, Sung Han Lee, Man Gyun Na. 2010. Smart Soft-Sensing for the Feedwater Flowrate at PWRs Using a GMDH Algorithm. IEEE Transactions on Nuclear Science 57:1, 340-347. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Heon Young Yang, Sung Han Lee, Man Gyun Na. 2009. Monitoring and Uncertainty Analysis of Feedwater Flow Rate Using Data-Based Modeling Methods. IEEE Transactions on Nuclear Science 56:4, 2426-2433. [CrossRef] 7. Maria P. Cadeddu, David D. Turner, James C. Liljegren. 2009. A Neural Network for Real-Time Retrievals of PWV and LWP From Arctic Millimeter-Wave Ground-Based Observations. IEEE Transactions on Geoscience and Remote Sensing 47:7, 1887-1900. [CrossRef] 8. Tao Lu, Martti Viljanen. 2009. Prediction of indoor temperature and relative humidity using neural network models: model comparison. Neural Computing and Applications 18:4, 345-357. [CrossRef] 9. Wenwu Tang, George P. Malanson, Barbara Entwisle. 2009. Simulated village locations in Thailand: a multi-scale model including a neural network approach. Landscape Ecology 24:4, 557-575. [CrossRef] 10. M. Tsujitani, M. Sakon. 2009. Analysis of Survival Data Having Time-Dependent Covariates. IEEE Transactions on Neural Networks 20:3, 389-394. [CrossRef] 11. L. J. Lancashire, C. Lemetre, G. R. Ball. 2008. An introduction to artificial neural networks in bioinformatics--application to complex microarray and mass spectrometry datasets in cancer studies. Briefings in Bioinformatics 10:3, 315-329. [CrossRef] 12. Larry Buckley, Jeremy Collie, Lisa A. E. Kaplan, Joseph Crivello. 2008. Winter Flounder Larval Genetic Population Structure in Narragansett Bay, RI: Recruitment to Juvenile Young-of-the-Year. Estuaries and Coasts 31:4, 745-754. [CrossRef] 13. L. Leistritz, M. Galicki, E. Kochs, E.B. Zwick, C. Fitzek, J.R. Reichenbach, H. Witte. 2006. Application of Generalized Dynamic Neural Networks to Biomedical Data. IEEE Transactions on Biomedical Engineering 53:11, 2289-2299. [CrossRef]
14. Masaaki Tsujitani, Masahiko Aoki. 2006. Neural regression model, resampling and diagnosis. Systems and Computers in Japan 37:6, 13-20. [CrossRef] 15. Massih R. Amini, Patrick Gallinari. 2005. Semi-supervised learning with an imperfect supervisor. Knowledge and Information Systems 8:4, 385-413. [CrossRef] 16. J. F. Crivello, D. J. Danila, E. Lorda, M. Keser, E. F. Roseman. 2004. The genetic stock structure of larval and juvenile winter flounder larvae in Connecticut waters of eastern Long Island Sound and estimations of larval entrainment. Journal of Fish Biology 65:1, 62-76. [CrossRef] 17. Nobuhiko Yamaguchi, Naohiro Ishii. 2004. Combining classifiers in error correcting output coding. Systems and Computers in Japan 35:4, 9-18. [CrossRef] 18. Yacine Oussar , Gaétan Monari , Gérard Dreyfus . 2004. Reply to the Comments on “Local Overfitting Control via Leverages” in “Jacobian Conditioning Analysis for Model Validation” by I. Rivals and L. PersonnazReply to the Comments on “Local Overfitting Control via Leverages” in “Jacobian Conditioning Analysis for Model Validation” by I. Rivals and L. Personnaz. Neural Computation 16:2, 419-443. [Abstract] [PDF] [PDF Plus] 19. I. Rivals, L. Personnaz. 2003. No free lunch with the sandwich. IEEE Transactions on Neural Networks 14:6, 1553-1559. [CrossRef] 20. Yulei Jiang. 2003. Uncertainty in the output of artificial neural networks. IEEE Transactions on Medical Imaging 22:7, 913-921. [CrossRef] 21. G. Papadopoulos, P.J. Edwards, A.F. Murray. 2001. Confidence estimation methods for neural networks: a practical comparison. IEEE Transactions on Neural Networks 12:6, 1278-1287. [CrossRef] 22. K. Lewenstein. 2001. Radial basis function neural network approach for the diagnosis of coronary artery disease based on the standard electrocardiogram exercise test. Medical & Biological Engineering & Computing 39:3, 362-367. [CrossRef] 23. T. Koshimizu, M. Tsujitani. 2000. Neural discriminant analysis. IEEE Transactions on Neural Networks 11:6, 1394-1401. [CrossRef] 24. Y. Bissessur, E.B. Martin, A.J. Morris, P. Kitson. 2000. Fault detection in hot steel rolling using neural networks and multivariate statistics. IEE Proceedings - Control Theory and Applications 147:6, 633. [CrossRef] 25. A.-P. N. Refenes, A. D. Zapranis. 1999. Neural model identification, variable selection and model adequacy. Journal of Forecasting 18:5, 299-332. [CrossRef] 26. Karin Haese . 1999. Kalman Filter Implementation of Self-Organizing Feature MapsKalman Filter Implementation of Self-Organizing Feature Maps. Neural Computation 11:5, 1211-1233. [Abstract] [PDF] [PDF Plus] 27. N.W. Townsend, L. Tarassenko. 1999. Estimations of error bounds for neural-network function approximators. IEEE Transactions on Neural Networks 10:2, 217-230. [CrossRef]
28. B. LeBaron, A.S. Weigend. 1998. A bootstrap evaluation of the effect of data splitting on financial time series. IEEE Transactions on Neural Networks 9:1, 213-220. [CrossRef] 29. Peter SussnerPerceptrons . [CrossRef]
Communicated by Federico Girosi
~
Neural Networks for Optimal Approximation of Smooth and Analytic Functions
We prove that neural networks with a single hidden layer are capable of providing an optimal order of approximation for functions assumed to possess a given number of derivatives, if the activation function evaluated by each principal element satisfies certain technical conditions. Under these conditions, it is also possible to construct networks that provide a geometric order of approximation for analytic target functions. The permissible activation functions include the squashing function ( 1 - (,-')-' as well as a variety of radial basis functions. Our proofs are constructive. The weights and thresholds of our networks are chosen independently of the target function; we give explicit formulas for the coefficients as simple, continuous, linear functionals of the target function. 1 Introduction
~
.-
-
In recent years, there has been a great deal of research in the theory of approximation of real valued functions using artificial neural networks tvith one or more hidden layers, with each principal element (izmroii) evaluating a sigmoidal or radial basis function (Barron 1993; Barron and Barron 1988; Broomlitad and Lowe 1988; Cybenko 1989; Girosi c t d . 1995; Hornik ef 01. 1989; Leslino ct n1. 1993; Moody and Darken 1989; Poggio and Girosi 1990; Poggio ci nl. 1993). A typical density result sliows that a network can approximate an arbitrary function in a given function class to any degree of accuracy. Such theorems are proved for instance in Cvbenko (1989) and Hornik c>t a / . (1989) in the case of sigmoidal activation functions and in Park and Sandberg (1991) and Powell (1991) for radial basis functions. Very general theorems of this nature can be found in Leshno ct (11. (1993) and Mhaskar and Micchelli (1992). A related important problem is the coiiiplesity / ~ ~ i b k i iiit,. , to determine the number of neurons required to guarantee that nll functions, assumed to belong to a certain function class, can be approximated within n prescribed accuracy, f . For example, the now classical result of Barron (1993) shows that if the function is assumed to satisfy certain conditions
Optimal Approximation of Functions
165
expressed in terms of its Fourier transform, and each of the neurons evaluates a sigmoidal activation function, then at most U(E-*) neurons are needed to achieve the order of approximation t. An interesting aspect of this result is that the order of magnitude of the number of neurons is independent of the number of variables on which the function depends. Other bounds of this nature are obtained in Mhaskar and Micchelli (1994) when the activation function is not necessarily sigmoidal. A very common assumption about the function class is defined in terms of the number of derivatives that a function possesses. For example, one is interested in approximating all functions of s real variables having a continuous gradient. By a suitable normalization, one may assume that the gradient is bounded by 1. It is known (e.g., DeVore ef al. 1989) that any reasonable approximation scheme to provide an approximation order t for all functions in this class must depend upon at least (2(t-”) parameters. In Mhaskar (19931, we showed how to construct networks with two hidden layers, each neuron evaluating a bounded sigmoidal function, to accomplish such an approximation order with C?(f-’) neurons. Mhaskar and Micchelli (1995) have studied this problem in much greater detail. The best result known so far for networks with a single hidden layer is that O[t-’-’ log(1/E)] neurons are enough if the activation function is the squashing function 1/(1+ e - x ) . In our work (Chui etal. 19951, we have shown that if s > 1 and the approximation is required to be “localized”, then at least Ole-’ log(l/t)]neurons are necessary, even if different n e u r p s may evaluate different activation functions. A detailed discussion of the notion of localized approximation is not relevant within the context of this paper; we refer the reader to Chui ef nl. (1995). We made a conjecture in Mhaskar (1994) that with a sigmoidal activation function, the number of neurons necessary to provide the approximation order t to all functions in this class, with or without localization, cannot be U(t-s). In this paper, we disprove this conjecture. We prove that if the activation function satisfies certain technical conditions then the optimal order of approximation for this class (and other similar classes) can be achieved with a neural network with a single hidden layer. Our results will be formulated for neural networks more general than the traditional networks evaluating a univariate activation function. In particular, our results will include estimates on the order of approximation by generalized regularization networks introduced in Girosi et al. (19951, Poggio and Girosi (1990), and Poggio et al. (1993). The precise definitions and results will be given in the next section. The proofs of all the new results in Section 2 will be given in Section 3.
H. N. Mhaskar
166
2 Main Results
-
Let 1 5 d 5 s, i i 2 1 be integers, f : R + R and q5 : R f R. A getieralized traiislafioii nefziiork with ti neurons evaluates a function of the form C;'=,a k q ( A k ( . ) + bk) where the weights ALSare d x s real matrices, the thresholds bk E R' and the coefficients ah E R (1 5 k 5 11). The set of all such functions (with a fixed 1 7 ) will be denoted by II,;,,,5. We are interested in approximating the target function f by elements of &,;ll,s on [-1.1]'. In the case when d = 1, the class I14;l,.sdenotes the outputs of the classical neural networks with one hidden layer consisting of 11 neurons, each evaluating the univariate activation function d. In Girosi e f nl. (19951, Poggio and Girosi (19901, and Poggio et a / . (1993),the authors have pointed out the importance of the study of the more general case considered here. They have demonstrated how such general networks arise naturally in applications such as image processing and graphics as solutions of certain extremal problems. Our approximations will not be constructed as in Girosi et a / . (1995), Poggio and Girosi (19901, and Poggio et al. (1993) as solutions of extremal problems, but rather will be given explicitly. They will not provide the best approxiinntion, but will nevertheless provide the optimal order of approximation. An additional advantage of our networks is that the weights Aks and the thresholds bks will be determined independently of the target function f . We observe in this connection that the determination of these quantities is typically a major problem in most traditional trqining algorithms such as backpropagation. In fact, the only "training" required for our networks consists of evaluating the coefficients f l k . We give explicit formulas for these coefficients as linear combinations of the Fourier-Chebyshev coefficients of the target function. Alternative formulas based on the values of the target function can also be given, but we do not present these alternative constructions here, since a good discussion of this issue would require us to elaborate upon some very techincal background material. From a practical perspective, we observe that we are assuming that the target function can be sampled without noise at prescribed points. Our constructions are extremely simple, use no optimization, and avoid all the problems, for example, local minima, stability, etc., associated with the classical, optimization-based training paradigms such as backpropagation. We fully expect the constructions to be robust under noise, but have not developed any theory to deal with this question. First, we introduce some notations. If A C RS is Lebesgue measurable, and f : A + R is a measurable function, we define the P ( A )norms off as follows.
The class of all functions f for which
I If1 l p . ~ < 00
is denoted by Lp'(A). It is
Optimal Approximation of Functions
167
customary (and in fact essential from a theoretical point of view) to adopt the convention that if two functions are equal almost everywhere in the measure-theoretic sense then they should be considered as equal elements of LJ’(A).We make two notational simplifications. The symbol L”(A) will denote the class of continuous functions on A. In this paper, we have no occasion to consider discontinuous functions in what is normally denoted by L“(A), and using this symbol for the class of continuous functions will simplify the statements of our theorems. Second, when the set A = [-131]’, we will not mention the set in the notation. Thus, Ilfll, will mean I f ] l p . ~ - l , l ~ , etc. We measure the degree of approximation of f by the expression Ed;ll.pcf)
:= inf{llf -gllp
:
g E rIo;lIs}
(2.2)
The quantity E~;,l,pcf) denotes the theoretically minimal error that can be achieved in approximating the function f in the Lp norm by generalized translation networks with n neurons each evaluating the activation function 4. The complexity problem is clearly equivalent to obtaining sharp estimates on Eo,l,,pcf). In theoretical investigations of the degree of approximation, one typically makes an a priori assumption that the target function f, although itself unknown, belongs to some known class of functions. In this paper, we are interested in the Sobolev classes, which we define as follows. Let r > 1 be an integer and Q be a cube in R . The class WF.s(Q) consists of all functions with r - 1 continuous partial derivatives on Q, which in turn can be expressed (almost everywhere on Q) as indefinite integrals of functions in LF’(Q). Alternatively, the class WI,,(Q) consists of functions that have, at almost all points of Q, all partial derivatives u p to order r such that all of these derivatives are in Lp(Q). The Sobolev norm of f E WF,,(Q) is defined by
where for the multiinteger k = (kl.. . . .k,) E Z’, 0 5 k 5 r means that each component of k is nonnegative and does not exceed r, /kl := C;=, lk,l and
Again, WE(Q) will denote the class of functions that have continuous derivatives of order r and lower. As before, if Q = [-1, l]’,we will not mention it in the notation. Thus, we write !N& = W;,,([-l, 11’) etc. Since the target function itself is unknown, the quantity of interest is hJ,n,p.r.s
:= suP{E$;n,pCf)
Ilfllw~,, L 11.
(2.4)
H. N.Mhaskar
168
We observe that any function in W!, can be normalized so that I if1 lw;,\ 1. 1. measures the "worst case" degree of approximation by Hence, E~,;lr,p,r,s generalized translation networks with n neurons under the assumption that f E WF.sand is properly normalized. Since any element of H&;ll,sdepends upon ( d s + d + 1)nparameters, the general results by DeVore et a/. (1989) indicate that
The general results in DeVore et a / . (1989) are not exactly applicable here since the definition of the degree of approximation does not preclude the possibility that the parameters involved in the approximation may be discontinuous functionals on the class in question. Therefore, equation 2.5 is only a conjecture, rather than a known fact. In our constructions below, the parameters are continuous functionals of the class, and hence, equation 2.5 is applicable, and shows that the networks provide an optimal order of approximation subject to the continuity requirement. In the sequel, we make the following convention regarding constants. The letters c. c1. cz. . . . will denote positive constants which may depend upon p , r, s and other explicitly indicated quantities. Their value may be different at different occurrences, even within a single formula. We now formulate our main theorem.
Theorem 2.1. Let 1 5 d 5 s, r 2 1, n 2 1be integers, 1 5 p 1. m, d : R" + R he infinitely ninny times coiztiizuously differentiable in some open sphere in R i . W e firrther assiinie that there exists b in this sphere slick that
Dkd(b)# 0.
k E Z". k 2 0.
(2.6)
Then there exist d x s matrices {A,}iLl 7iiifh the follozuiizg property. For any f E Wr,s,there exist coefficients a , ( f ) such that
The ftinctionals
flj are
E@;,,.p,?,$< - cn-rls
continuoiis linear fiiizctioiials on Wr.,. hi particular, (2.8)
We observe that the condition equation (2.6) implies that 4 is not a polynomial. For the function #(x) := cosxl + C O S X(d ~ ,= 2), we have D('.l)d = 0. Thus, when d > 1, the assumption 2.6 is stronger than the assumption that 4 is not a polynomial. We suspect that it is a stronger assumption also in the case when d = 1. Proposition 2.2 shows that equation 2.6 is nevertheless satisfied by a large class of functions. In light of the first part of this proposition, we doubt that in the case when d = 1, a nonpolynomial function that is infinitely many times differentiable but does not satisfy equation 2.6 would be of any practical interest whatever.
Optimal Approximation of Functions
169
Proposition 2.2. Let d 2 1 be an integer and Cp : Rd + R be infinitely many times continuously differentiable on an open sphere B. If equation 2.6 is not satisfied, i.e., at every point of B some derivative of Q, is zero, then for every closed sphere U c B, there exists a multiinteger r 2 0, a sphere N U and functions h,, N of d - 1 real variables such that d r,-1
d(x) =
hlj,N(X1..
. . .X 1 - 1 > X l + l . . . .
Xd)4.
X
E
N.
(2.9)
i=l j=1 = 1 and 4 is analytic in a (complex) neighborhood of some point in B but not a polynomial, then equation 2.6 is satisfied.
If d
Some of the important examples where equation 2.6 is satisfied are the following, where for x E Rd,we write llxl/ := CC,"=,:');x
d
= 1.
@(x)
:= (I
+ e-')-'
(The squashing function)
d 2 1.
d(x) := (1 + / ( x ( I ~ ) ~ .
d 2 1.
q E Z. q > d / 2 .
{
0
$ Z (Generalized multiquadrics)
11x112q-d log j 1x11.
'(')
:=
4iX)
:= exP(-/IXIl2).
jlX~~2q-d%
d even, (Thin plate splines) d odd
and
d L 1,
(The Gaussian function)
If the target function is merely assumed to be in LI' rather than in W%, the estimate of equation 2.7 leads to a similar estimate in terms of the modulus of smoothness of the function. This is a fairly standard argument in approximation theory, and does not add any new insight to the problem. Since a formulation of this result would require us to introduce a great deal more notation, we omit this apparent generalization. The idea behind the proof of Theorem 2.1 is simple. It is well known that for every integer m 2 r, there exists a polynomial P l l l ( f )of coordinatewise degree not exceeding m such that for every f E W!,s, (2.10) Following Leshno etal. (1993) we express each monomial in Pltl(f)in terms of a suitable derivative of 4. In turn, this derivative can be approximated by an appropriate divided difference, involving (3(mS)evaluations of 4. A careful bookkeeping then yields Theorem 2.1. If the target function f is analytic in the polyellipse
E,, := {z = (z,.. . . . zs) E C' :
+1 - J
IZ,
I p. j =
I... ..s>
(2.11)
H. N. Mhaskar
170
for some p > 1 and 1 < (11 < /I then for every integer i n 2 1 there exists a polynomial (Siciak 1962) L,,, cf ) (different from the polynomials described above) with coordinatewise degree not exceeding nz such that
\If
- LJIIcf)IIp
(2.12)
5 Cp.pl/I~J”z$f(z)l’
Approximating these polynomials by networks as above, we get the following Theorem 2.3.
Theorem 2.3. Let 1 5 d 5 s, JI 2 1 he inftJgers,1 < ,I)I < p, 1 5 p 5 w mid f be aiinlytic in thr yolyrlliyse &,,. Fztrtlzer, let @ be as iiz Theorem 2.1. Tlzeii E$,JI,pcf)
5 C,l.plp~’l”’
2;: if(z)l’
(2.13)
It is possible to obtain some estimates on the degree of approximation under substantially weaker assumptions on 0 than those assumed in Theorems 2.1 and 2.3. One strategy, as in Leshno et al. (19931, would be to take the convolution of ?J with a suitable, infinitely many times continuously differentiable function; apply Theorem 2.1 (or Theorem 2.3) to the resulting function and use a quadrature formula. We have not yet worked out the details of this argument, but it seems unlikely that these estimates would be optimal under the weak assumptions on $9. Using the ideas in the proof of Theorem 2.1, it is also possible to obtain estimates for simultaneous approximation of derivatives of the target function. This would follow from the corresponding theorems in the theory of trigonometric approximation (cf. Mhaskar and Micchelli 1995). Although the technical details in these generalizations are expected to be of some interest, we do not wish to pursue these ideas further in this paper. 3 Proofs
To prove Theorem 2.1, we first recall some well known facts from the theory of trigonometric approximation. These will be used to construct the polynomial operator in equation 2.10. The subspace of 27r-periodic functions in Lp( [ - T . TI‘) [respectively, Wl,s([ - T . TI’)] will be denoted by Li” (respectively, Wi.;:). If g E Lp*, its Fourier coefficients are defined by
k
E
Z‘
(3.1)
The partial sums of the Fourier series of g are defined by
sm(g.t) :=
g(k)dkt.
m E Z’. m 2 0. t
E [-7r.~]’.
(3.2)
-rnSk
where the notation k 5 m means k, 5
in,,
1 5 j 5 s. The de la Vallee
Optimal Approximation of Functions
171
Poussin operator is defined by
The de la Vallee Poussin operator has the following important property.
Proposition 3.1. (Cf. Timan (1963) and Mhaskar and Micchelli (1995).) ~fr 2 I, s. m 2 1 are integers, 1 I p I 00 and g E w!,: then unI(g)is a trigonometric polynomial of coordinatewise order at most 2m and C
(3.4)
118 - ~"~(8)llPl-%?rl' 5 ~ l l g l l w ~ : .
Further, (3.5)
where N
:= s/ min(p. 2)
The standard way to construct a periodic function from a function on [-1. 1Is is to make the substitution x, =: cos f,, 1 5 j 5 s, for x E [-1. 1Is and t E [ - T . T ] ' . Obviously, the integrals defining the L'' norms are no longer equal under this substitution. Therefore, we make the following construction. According to Stein (1970, §V1.3.1), there exists a continuous linear + W&,([-2.2]') such that the restriction of TCf) to [-1.11' operator T : W!,, is (almost everywhere) equal to f . The continuity of the operator T means that
I ITCf) I I w;'.\ ( , - 2 2 ] ' ) 5 c I If I I w;,.\
(3.6)
for every f in W!,'. In practical applications, f itself may be defined on [-2%2],.We may then choose to work with f itself rather than TCf). However, the bounds in equation 2.7 will then depend upon rather than I If1 . Next, let )I be an infinitely many times continuously differentiable function that takes the value 1 on [-1. 1Is and 0 outside of [-3/2,3/2]'. Then the function T ( f ) $ coincides with f on [-l.l]', is identically 0 outside [-3/2.3/2]', and
JfJIWf',([-2,1.)
lw/j
I I~Cf)cIIlw:,T([-2.z,.) I cllf l lw;:,.
(3.7)
In the sequel, we denote the extension T ( f ) qo f f again by the symbol f . We define a 2~-periodicfunction from the function f (extended as above) by the formula f * ( t ) :=f(2cost, . . . . . 2cost,),
Then f " E
W!,: and, using
t E
[-7r,7$.
(3.8)
induction and the fact that f is identically 0
H. N. Mhaskar
172
outside of [-3/2.3/2]', we conclude from equation 3.7 that
clllfllw;~,5I llf*llw;: I c2Ilfllwr:
(3.9)
Now, it is easy to check that for any integer function and hence we may write
111,
zl,,,(f*)
is an even
(3.10) For integer k 2 0, let Tk be the Chebyshev polynomial adapted to the interval [-2.21, defined by (cf. Timan 1963) T k ( 2 ~ 0 s t=) cos(kt). and for a multiinteger k
t
E
[-..TI.
(3.11)
2 0, let (3.12)
(3.13) is an algebraic polynomial of coordinatewise degree at most 2111 and is related to u,,,Cf*) by the formula
Pil,[f.(2cost,. . . . .2COS t,)] = zl,iiCf*. t).
tE
[-..
TI'.
Consequently, we obtain from equations 3.4 and 3.9 that
llf
c
- ~l~l(f)llP I ,Ilfllw/,.
(3.14)
Also, in view of equations 3.5 and 3.9, we have (3.15) The next step in the proof of Theorem 2.1 is to construct an approximation to every polynomial. This is summarized in the following Lemma 3.2. Lemma 3.2. Let 4 satisfy the conditions of Theorem 2.1, iri 2 1 be a11 integer, and k 2 0 be m y nzultiiizteger iiz Z mith max15i5sik,i 5 711. Tlieii for earry E > 0, there exists Gk.,ii,c E n9;(6i,r+l)+,s slich that IlTk - G ~ . I I ~I, ~ ~ ~ ~ X f
(3.16)
Optimal Approximation of Functions
173
The weights and fhresholds of each G k . m . c may be chosen from a fixed set with cardinality not exceeding (6m + 1)5. Proof. First, we consider the case when d = 1. The point b in equation 2.6 is a real number in this case and accordingly, will be denoted by b. Let 0 be infinitely many times continuously differentiable on [b - b. b + 61. For a multiinteger p = ( P I . . . . p s ) , and x E R5,we write x p := where 0"is interpreted as 1. From the formula
n:=,XI",
dlPl
pp(w;x) :=
awy . . . awk
d(w.
x
+ b) = xP$('P~)(w. x + b)
(3.17)
we conclude that
Following the ideas in Leshno et al. (1993) we now replace the partial ; an appropriate divided difference. For multiintederivative ~ ~ (x)0by gers p and r, we write
(p)
)(;
Ii
:= / = 1
For any k > 0, the network defined by the formula (3.19) is in n,,,,,,,, (!,,+I) ther, we have
and represents a divided difference for ~ ~ (x )0. Fur;
l l @ p . ~ ~ - @p(O;.)llx
5 Mo,,,,&2.
max 1
~ I~ m. 1
1 9 3
jh/ 5 6/(3ms) (3.20)
where M4,17t,s is a positive constant depending only on the indicated variables. Now, we write T k ( x ) := x O < p S k '&,pxp, and choose -
Then equation 3.20 implies that the network G k m , c ( X ) :=
G k m,L defined
~k,p(~"p"(b))-l@p.~~,,,,~(X). X
by
E [-I. 11'
(3.21)
KpSk
satisfies equation 3.16. For each k, the weights and thresholds in are chosen from the set {(k4,,,g, b ) : r E Z .
Irll 5 3m.
1S j L s } .
Gk,m,r
H. N. Mhaskar
1 74
+
The cardinality of this set is (6m l ) 5 .Therefore, G k E n9,(6,,,+1). ,. Next, if d > 1, and b is as in equation 2.6, then we consider the univariate function D(X) := C/J(x.h2..
. . .b,)
The function (T satisfies all the hypotheses of Theorem 2.1, with b , in place of b in equation 2.6. Taking into account the fact that D ( w . x + ~ , ) = c$(A,x + b) with
A, :=
[g)
any network in n,;,,,sis also a network in implies the lemma also when d > 1.
n~pp;,l.h. Therefore, the case d = 1 0
Proof of Theorem 2.1. Without loss of generality, we may assume that IZ 2 13s. Let 111 2 1 be the largest integer such that (12m + l ) sI 17. We define P,,,cf) = C05k5Z,,i V k ( f ) T k as in equation 3.13. In view of equation 3.15, the network
Ni(f.x ) :=
Vkcf)Gk.2r,i.,rr~'-n(X)
(3.22)
O
is in II&;li.S, and satisfies IIPlllcf)-ni;1cf)Ilx
Ic ~ ~ ~ - ~ l l f l l w ~ ~ , \
Since 1 lg/Ip I 2 / P / lgl IK for all Lebesgue measurable functions g on [-1. l]', we get from equation 3.14 that
llf
-ni;1cf)II,
Ic~r-~'sllfllw~~,~.
as required. Further, it is quite clear that the coefficients V k are continuous linear functionals on I-;'. Hence, the continuity assertion follows.
We will prove Proposition 2.2 after the proof of Theorem 2.3.
Proof of Theorem 2.3. Again, we may assume that n 2 7' and let i n 2 1 be the largest integer such that (6nz l)s5 1 2 . We write .yi := cos[(2j l)?r/(2m)],0 5 j 5 nr, and use the Lagrange interpolation polynomial L,,,(f) at the points {(xk,,,,,. . . , ~ k , , , ~ ) }0, 5 k 5 m, in place of P,,cf) in the proof of Theorem 2.1. According to Siciak (1962), this polynomial satisfies equation 2.12. Theorem 2.3 then follows as an application of Lemma 3.2 in exactly the same way as Theorem 2.1. 0
+
We end this section with a proof of Proposition 2.2.
+
Optimal Approximation of Functions
Proof of Proposition 2.2. For multiinteger k Z k := {x
E U :
P ( X ) =
175
2 0, let
O}.
Since equation 2.6 is not satisfied, we have U = UkEZ,,,k>O z k . Now, each Z k is a closed set and U being a closed sphere, is a complete metric space. Therefore, Baire's category theorem implies that for some multiinteger r 2 0, Z , contains a nonempty interior. Hence, there exists an open sphere N C U such that D'd(x) = 0 for every x E N. Equation 2.9 expresses d, as a solution of this differential equation on N. If d = 1, d, is analytic in a closed neighborhood U of some point xo E B and equation 2.6 is not satisfied, then we have proved that 4 is equal to a polynomial on some interval contained in U.The identity theorem of complex analysis then shows that 4 itself is a polynomial. 0 4 Conclusions
We have constructed generalized translation networks with a single hidden layer that provide an optimal order of approximation for functions in Sobolev classes similar to the order obtained in the classical polynomial approximation theory. If the target function is analytic, then it is possible to get a geometric rate of approximation, again similar to polynomial approximation. The weights and thresholds of our networks are chosen independently of the target function. We give explicit formulas for the coefficients, so that the "training" consists of calculating certain simple, coninuous linear functionals on the target function. The activation function for the network is fairly general, but has to satisfy certain smoothness conditions. Among the activation functions for which our theorems are applicable are the squashing function, the gaussian function, thin plate splines, and generalized multiquadric functions.
Acknowledgments I wish to thank Professors F. Girosi and T. Poggio, MIT Artificial Intelligence Laboratories, for their kind encouragement in this research. The research was supported in part by National Science Foundation Grant DMS 9404513 and Air Force Office of Scientific Research Grant F4962093-1-0150.
References Barron, A. R. 1993. Universal approximation bounds for superposition of a sigmoidal function. I E E E Cans. Information Theory 39, 930-945. Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unified view. In Symposium on the Interface: Statistics and Computing Science, April, Reston, Virginia.
176
H. N. Mhaskar
Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chui, C. K., Li, X., and Mhaskar, H. N. 1995. Some limitations on neural networks with one hidden layer. Submitted. Cybenko, G. 1989. Approximation by superposition of sigmoidal functions. Mnth. Control, Signal Syst. 2, 303-314. DeVore, R., Howard, R., and Michelli, C. A. 1989. Optimal nonlinear approximation. Matillscript. Math. 63, 469478. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural networks architectures. Neiiral Coinp. 7,219-269. Hornik, K., Stinchcornbe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Leshno, M., Lin, V., Pinkus, A., and Schocken, S. 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neiiral Networks 6 , 861-867. Mhaskar, H. N. 1993. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comp. Mntlz. 1, 61-80. Mhaskar, H. N. 1994. Approximation of real functions using neural networks. In Proceedings of lntermtional Conference on Coniputational Mathematics, H. P. Dikshit and C. A. Micchelli, eds. World Scientific Press, New Delhi, India. Mhaskar, H. N., and Micchelli, C. A. 1992. Approximation by superposition of a sigmoidal function and radial basis functions. Adz,. Appl. Math. 13, 350-373. Mhaskar, H. N., and Micchelli, C. A. 1994. Dimension independent bounds on the degree of approximation by neural networks. IBM 1. Res. Deu. 38, 277-284. Mhaskar, H. N., and Micchelli, C. A. 1995. Degree of approximation by neural and translation networks with a single hidden layer. Adv. App. Math. 16, 151-183. Moody, J., and Darken, C. 1989. Fast learning in networks of locally tuned processing units. Neiiral Comp. 1(2), 282-294. Park, J. and Sandberg, I. W. 1991. Universal approximation using radial basis function networks. Neural Comp. 3, 246-257. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. IEEE 78(9). Poggio, T., Girosi, F., and Jones, M. 1993. From regularization to radial, tensor, and additive splines. In Neural Networks for Signal Processing, III, C. A. Kamm, G. M. Kuhn, B. Yoon, R. Chellappa, and S. Y. Kung, eds., pp. 3-10. IEEE, New York. Powell, M. J. D. 1992. The theory of radial basis function approximation. In Adzlances in Niitrrerical Analysis III, Wavelets, Subdivision Algorithms and Radial Bnsis Functions, W. A. Light, ed., pp. 105-210. Clarendon Press, Oxford. Siciak, J. 1962. On some extremal functions and their applications in the theory of analytic functions of several complex variables. Trans. A m . Math. SOC.105, 322-357. Stein, E. M. 1970. Singiilar Integrals and Differentiability Properties of Functions. Princeton Univ. Press, Princeton.
Optimal Approximation of Functions
177
Timan, A. F. 1963. Theory of Approximation of Functions of a Real Variable. MacmilIan, New York.
Received January 23, 1995; accepted April 10, 1995.
This article has been cited by: 2. V. E. Maiorov. 2010. Best approximation by ridge functions in L p -spaces. Ukrainian Mathematical Journal . [CrossRef] 3. S. Giulini, M. Sanguineti. 2009. Approximation Schemes for Functional Optimization Problems. Journal of Optimization Theory and Applications 140:1, 33-54. [CrossRef] 4. D.G. Khairnar, S.N. Merchant, U.B. Desai. 2007. Radial basis function neural network for pulse radar detection. IET Radar, Sonar & Navigation 1:1, 8. [CrossRef] 5. K. Schwab, M. Eiselt, P. Putsche, M. Helbig, H. Witte. 2006. Time-variant parametric estimation of transient quadratic phase couplings between heart rate components in healthy neonates. Medical & Biological Engineering & Computing 44:12, 1077-1083. [CrossRef] 6. Zongben Xu, Jianjun Wang. 2006. The essential order of approximation for nearly exponential type neural networks. Science in China Series F: Information Sciences 49:4, 446-460. [CrossRef] 7. P. Chandra, Y. Singh. 2004. Feedforward Sigmoidal Networks—Equicontinuity and Fault-Tolerance Properties. IEEE Transactions on Neural Networks 15:6, 1350-1366. [CrossRef] 8. L. Zhang, W. Zhou, L. Jiao. 2004. Hidden Space Support Vector Machines. IEEE Transactions on Neural Networks 15:6, 1424-1434. [CrossRef] 9. Felice Arena, Silvia Puca. 2004. The Reconstruction of Significant Wave Height Time Series by Using a Neural Network Approach. Journal of Offshore Mechanics and Arctic Engineering 126:3, 213. [CrossRef] 10. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 11. S. Watanabe. 2001. Learning efficiency of redundant neural networks in Bayesian estimation. IEEE Transactions on Neural Networks 12:6, 1475-1486. [CrossRef] 12. V. Maiorov, R. Meir. 2001. Lower bounds for multivariate approximation by affine-invariant dictionaries. IEEE Transactions on Information Theory 47:4, 1569-1575. [CrossRef] 13. Sumio Watanabe . 2001. Algebraic Analysis for Nonidentifiable Learning MachinesAlgebraic Analysis for Nonidentifiable Learning Machines. Neural Computation 13:4, 899-933. [Abstract] [PDF] [PDF Plus] 14. R. Meir, V.E. Maiorov. 2000. On the optimality of neural-network approximation using incremental algorithms. IEEE Transactions on Neural Networks 11:2, 323-337. [CrossRef]
15. Wenxin Jiang , Martin A. Tanner . 1999. On the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear ModelsOn the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear Models. Neural Computation 11:5, 1183-1198. [Abstract] [PDF] [PDF Plus] 16. G. Ritter. 1999. Efficient estimation of neural weights by polynomial approximation. IEEE Transactions on Information Theory 45:5, 1541-1550. [CrossRef] 17. Nahmwoo Hahm, Bum Il Hong. 1999. Extension of localised approximation by neural networks. Bulletin of the Australian Mathematical Society 59:01, 121. [CrossRef] 18. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef] 19. Peter L. Bartlett , Vitaly Maiorov , Ron Meir . 1998. Almost Linear VC-Dimension Bounds for Piecewise Polynomial NetworksAlmost Linear VC-Dimension Bounds for Piecewise Polynomial Networks. Neural Computation 10:8, 2159-2173. [Abstract] [PDF] [PDF Plus] 20. A.J. Zeevi, R. Meir, V. Maiorov. 1998. Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory 44:3, 1010-1025. [CrossRef] 21. Pencho P. Petrushev. 1998. Approximation by Ridge Functions and Neural Networks. SIAM Journal on Mathematical Analysis 30:1, 155. [CrossRef] 22. H. N. Mhaskar, Nahmwoo Hahm. 1997. Neural Networks for Functional Approximation and System IdentificationNeural Networks for Functional Approximation and System Identification. Neural Computation 9:1, 143-159. [Abstract] [PDF] [PDF Plus] 23. C. K. Chui, Xin Li, H. N. Mhaskar. 1996. Limitations of the approximation capabilities of neural networks with one hidden layer. Advances in Computational Mathematics 5:1, 233-243. [CrossRef]
Communicated by Michael Jordan
Equivalence of Linear Boltzmann Chains and Hidden Markov Models David J. C. MacKay Covendish Laboratory, Madingley Rood, Coinbridge CB3 OHE, United Kingdom Several authors have studied the relationship between hidden Markov models and "Boltzmann chains" with a linear or "time-sliced" architecture. Boltzmann chains model sequences of states by defining statestate transition energies instead of probabilities. In this note I demonstrate that under the simple condition that the state sequence has a mandatory end state, the probability distribution assigned by a strictly linear Boltzmann chain is identical to that assigned by a hidden Markov model. Several authors have made a link between hidden Markov models for time series and energy-based models (Luttrell 1989; Williams 1990; Saul and Jordan 1995). Saul and Jordan (1995) discuss a linear Boltzmann chain model with state-state transition energies All, (going from state i to state i') and symbol emission energies B,, under which the probability of an entire state {il.j~}fgiven the length of the sequence, L, is
P ( {il.jl}f
I II.A. B. L. XBC)
where Z ( n. A. B, L ) is the obvious normalizing constant. Here the symbol i runs over n discrete "hidden" states, and j runs over m visible states. In contrast, a hidden Markov model (HMM) assigns a probability distribution of the form
where TT, is a prior probability vector for the initial state, a1,8 is a transition probability matrix, and b,, is a matrix of emission probabilities satisfying, respectively:
C7rI= 1. I
C alII= 1 Vi. I'
and
b,, = 1 Vi
(3)
I
Nrirrnl Computnfian 8, 178-181 (1996) @ 1995 Massachusetts Institute of Technology
Boltzmann Chains and Hidden Markov Models
179
Here again, the symbol i runs over an alphabet of YI hidden states, and j runs over m visible states. While any HMM can be written as a linear Boltzmann chain by setting exp(-A,,t) = aiif, exp(-B 11) = bij and exp(-ni) = 7ri, not all linear Boltzmann chains can be represented as HMMs (Saul and Jordan 1995). However, the difference between the two models is minimal. To be precise, if the final hidden state iL of a linear Boltzmann chain is constrained to be a particular end state, then the distribution over sequences is identical to that of a
hidden Markov model. Proof. Start from the distribution ( I ) and consider the quantity in the exponent. The probability distribution over states {il,j,}+is unchanged if we subtract arbitrary constants LL, u from this exponent. The distribution will also be unaffected if we add arbitrary terms bIl to every appearance of B,,,,, provided we also subtract [A, from every term A,,,,,,. And we may similarly add a,,,, to every term Ai,,,,, if we also subtract nil+, from the following term Ail+,,,+,.The probability distribution may therefore be rewritten unchanged (except for the normalizing constant) as
P({i,.j l } : exp
I II.A. B, L , EBC) c(
[-(n;,+
ail
+
+ Ai,i,+!+
+
ail+,
1=1
L
-
L-1 - ~ ( - 0, ~
C(Bi,jI+ Pi,) + OiL +Pi, I=1
a}
1
(4)
where p, u,{ni. are arbitrary quantities. This probability distribution has the form of an HMM (equation 2) if 1. the quantities 7rl = exp(-(II,+ai+p)), a,it = exp(ai+[ji-Aiir -u) and b,, f exp[-(Bi, pi)] satisfy the normalization conditions ( 3 ) . 2. the trailing term aiL+ ,OiL can be treated as a constant, which holds if we assume that iL is fixed to a particular end state iL = n, say (a commonly applied constraint in the HMM literature).
+
Does a solution over p, u, {ni.@i}of the normalization conditions ( 3 ) exist? Trivially, we find for /3,:
The normalization condition that { a , }and u must satisfy is
C exp(n, + [jr
-
A,,, -
azi
- u)= 1 tti
I'
Rearranging, we obtain C[exp(P, - A,,~)][exp(-a,O] = exp(.)[exp(-a,)l I'
b'i
180
David MacKay
which can be recognized as an eigenvector/eigenvalue equation for the matrix Mil/ = [exp(ijl- All{)],with exp(v) being the eigenvalue and [exp(-n,)] being the eigenvector. This eigenproblem has a solution, by the Perron-Frobenius theorem (Seneta 1973, p. l), which states that a positive matrix be., one in which every element Mzl,is positive) has a positive eigenvector with positive eigenvalue. A solution for { n i l } and I / therefore exists. Finally p is given by
This completes the proof.
0
The linear Boltzmann chain therefore can differ from an HMM only in having a pseudo-prior over its final state as well as a pseudo-prior over its initial state. However the equivalence of linear Boltzmann chains to HMMs may prove fruitful in stimulating the development of new optimization methods for these models. And it may be found that Saul and Jordan's generalizations to Boltzmann chains with more complex architectures provide useful new modeling capabilities. The Boltzmann chain, and its relationship to HMMs, have also been studied by Luttrell (1989) who calls it the "Gibbs machine," and by Williams (1990), who calls it a "Boltzmann machine with a time-sliced architecture and Potts units." Luttrell discusses an alternative optimization algorithm to the decimation method suggested by Saul and Jordan, and notes that the Gibbs machine is only an improvement on the HMM when generalized to architectures with loops and other nontree structures. Williams also shows how to translate an HMM into a Boltzmann machine and notes that a generalized Boltzmann machine with a "componential'' structure (similar to the "coupled parallel Boltzmann chains" of Saul and Jordan) has greater representational power than a single HMM of the same size.
Acknowledgments
I thank Radford Neal and Chris Williams for comments on the manuscript. This work is supported by a Royal Society research fellowship. References Luttrell, S. P. 1989. The Gibbs Machine Applied to Hidden MarkozJModel Problems. Part 1: Basic Theoy, Tech. Rep. 99, SP4 division, RSRE, Malvern, U.K. Saul, L., and Jordan, M. 1995. Boltzmann chains and hidden Markov models. Adz).Neural Inform. Process. Syst. 7, 435442. Seneta, E. 1973. Non-Negatiz1e Matrices. John Wiley & Sons, New York.
Boltzmann Chains and Hidden Markov Models
181
Williams, C. K. I. 1990. Using deterministic Boltzmann machines to discriminate temporally distorted strings. Master’s thesis, Department of Computer Science, University of Toronto; see also Williams, C. K. I., and Hinton, G. E. 1990. Mean field networks that learn to discriminate temporally distorted strings. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky, J. L. Elman, T. S. Sejnowski, and G. E. Hinton, eds. Morgan Kaufmann, San Mateo, CA.
Received September 23, 1994; accepted April 10, 1995.
This article has been cited by: 2. Noah A. Smith, Mark Johnson. 2007. Weighted and Probabilistic Context-Free Grammars Are Equally ExpressiveWeighted and Probabilistic Context-Free Grammars Are Equally Expressive. Computational Linguistics 33:4, 477-491. [Abstract] [PDF] [PDF Plus] 3. Or Zuk, Ido Kanter, Eytan Domany. 2005. The Entropy of a Binary Hidden Markov Process. Journal of Statistical Physics 121:3-4, 343-360. [CrossRef]
Communicated by Fernando Pineda -
Diagrammatic Derivation of Gradient Algorithms for Neural Networks
Deriving gradient algorithms for time-dependent neural network structures typically requires numerous chain rule expansions, diligent bookkeeping, and careful manipulation of terms. In this paper, we show how to derive such algorithms via a set of simple block diagram manipulation rules. The approach provides a common framework to derive popular algorithms including backpropagation and backpropagationthrough-time without a single chain rule expansion. Additional examples are provided for a variety of complicated architectures to illustrate both the generality and the simplicity of the approach. 1 Introduction
~
Deriving the appropriate gradient descent algorithm for a new network architecture or system configuration normally involves brute force derivative calculations. For example, the celebrated backpropagation algorithm for training feedforward neural networks was derived by repeatedly applying chain rule expansions backward through the network (Rumelhart c.f 01. 1986; Werbos 1974; Parker 1982). However, the actual implementation of backpropagation may be viewed as a simple reversal of signal flow through the network. Another popular algorithm, backpropagationthrough-time for recurrent networks, can be derived by Euler-Lagrange or ordered derivative methods, and involves both a signal flow reversal and time reversal (Werbos 1992; Nguyen and Widrow 1989). For both of these algorithms, there is a r.c~iipocn71nature to the forward propagation o f states and the backward propagation of gradient terms. Furthermore, both algorithms are efficient in the sense that calculations are order N, where N is the number of variable weights in the network. These properties are often attributed to the clever manner in which the algorithms
Derivation of Gradient Algorithms
183
were derived for a specific network architecture. We will show, however, that these properties are universal to all network architectures and that the associated gradient algorithm may be formulated directly with virtually no effort. The approach consists of a simple diagrammatic method for construction of a reciprocal network that directly specifies the gradient derivation. This is in contrast to graphical methods, which simply illustrate the relationship between signal and gradient flow after derivation of the algorithm by an alternative method (Narendra and Parthasarathy 1990, Nerrand et al. 1993). The reciprocal networks are, in principle, identical to adjoint systems seen in N-stage optimal control problems (Bryson and Ho 1975). While adjoint methods have been applied to neural networks, such approaches have been restricted to specific architectures where the adjoint systems resulted from a disciplined Euler-Lagrange optimization technique (Parisini and Zoppoli 1994; Toomarian and Barhen 1992; Matsuoka 1991). Here we use a graphic approach for the complete derivation. We thus prefer the term "reciprocal" network, which further imposes certain topological constraints and is taken from the electrical network literature. The concepts detailed in this paper were developed in Wan (1993) and later presented in Wan and Beaufays (1994).' 1.1 Network Adaptation and Error Gradient Propagation. In supervised learning, the goal is to find a set of network weights W that minimizes a cost function ] = C,"=,Lk[d(k), y(k)], where k is used to specify a discrete time index (the actual order of presentation may be random or sequential), y(k) is the output of the network, d(k) is a desired response, and L k is a generic error metric that may contain additional weight regularization terms. For illustrative purposes, we will work with the squared error metric, L k = e(k)Te(k),where e(k) is the error vector. Optimization techniques invariably require calculation of the gradient vector a]/SW(k).At the architectural level, a variable weight w,, may be isolated between two points in a network with corresponding signals a , ( k ) and a,(k) [i.e., u,(k) = w,, a,(k)l. Using the chain rule, we get2
where we define the error gradient Sj(k) A a]/aa,(k). The error gradient 6,(k) depends on the entire topology of the network. Specifying the gradients necessitates finding an explicit formula for calculating the delta 'The method presented here is similar in spirit to Automatic Differentiation (Rall 1981; Griewank and Corliss 1991). Automatic Differentiation is a simple method for finding derivative of functions and algorithms that can be represented by acyclic graphs. Our approach, however, applies to discrete-time systems with the possibility of feedback. In addition, we are concerned with diagrammatic derivations rather than computational rule based implementations. *In the general case of a variable parameter, we have a,(k) = f[w,,.a,(k)],and equation 1.1 remains as 6,(k)&,(k)/Dw,,(k), where the partial term depends on the form o f f .
184
Eric A. Wan and Francoise Beaufays
terms. Backpropagation, for example, is nothing more than an algorithm for generating these terms in a feedforward network. In the next section, we develop a simple nonalgebraic method for deriving the delta terms associated with any network architecture.
2 Network Representation and Reciprocal Construction Rules ___
An arbitrary neural network can be represented as a block diagram whose building blocks are: summing junctions, branching points, univariate functions, multivariate functions, and time-delay operators. Only discrete-time systems are considered. A signal located within the network is labeled n , ( k ) . A synaptic weight, for example, niay be thought of as a linear transmittance, which is a special case of a univariate fuiiction. The basic neuron is simply a sum of linear trarismittances followed by a univariate sigmoid function. Networks can then be constructed from individual neurons, and may include additional functioiial blocks and time-delay operators that allow for buffering of signals and internal memory. This block diagram representation is really nothing more than the typical pictorial description of a neural network with a bit of added formalism. Directly from the block diagram we may construct the reciprocal rrefi i w k by reversing the flow direction in the original network, labeling all resulting signals ( k ) , and perforniing the following operations:
1. Suniirziriy jirrictioiir nrcl i ’ c p l n ~ diuitlr
hi’I711ChiilS poirifs.
Derivation of Gradient Algorithms
185
Explicitly, the scalar coiitiizzioiis function a , ( k ) = f [ n , ( k ) is ] replaced by h , ( k ) = f ’ [ n , ( k ) ]d,(k), where f’[a,(k)]~da,(k)/da,(k). Note this rule replaces a nonlinear function by a linear time-dependeiit transmittance. Special cases are 0
Weights: a, ‘i
= su,, a,,
YJ
in which case 6, ‘I
0
= 70,, d,.
3 1‘
51
Activation functions: a,,(k) = tanh[a,(k)].In this case, f’[a,(k)]= 1 - a:(k).
4. Mirltivariate functions are replaced with their Jacobians.
ai,,fW
I
I
A multivariate function maps a vector of input signals into a vector
of output signals, aout = F(a,,). In the transformed network, we have 6,,(k) = F’[a,,(k)] SOut(k),where F’[a,,(k)]~aa,,,(k)/aa,,(k) corresponds to a matrix of partial derivatives. For shorthand, F’[a,,(k)] will be written simply as F’(k). Clearly both summing junctions and univariate functions are special cases of multivariate functions. A multivariate function may also represent a product junction (for sigma-pi units) or even another multilayer network. For a multilayer network, the product F’[a,,(k)] b,,,(k) is found directly by backpropagating boutthrough the network.
Delay operators are replaced zuitk ndvance operafors.
A delay operator q-’ performs a unit time delay on its argument: a,(k) = q-’a,(k) = a,(k - 1). In the reciprocal system, we form a unit time advance: d,(k) = q+‘h,(k) = h,(k + 1). The resulting system is thus noncausal. Actual implementation of the reciprocal network in a causal manner is addressed in specific examples.
Eric A. Wan and F r a n p i s e Beaufays
186
6. Ozttputs become inputs.
network
a o = Yo
network
By reversing the signal flow, output nodes n,,(k) = y , , ( k )in the original network become input nodes in the reciprocal network. These inputs are then set at each time step to -2e,,(k). [For cost functions other than squared error, the input should be set to i ) L ~ / d y , , ( k ) . I These six rules allow direct construction of the reciprocal network from the original network.3 Note that there is a topological equivalence between the two networks. The order of computations in the reciprocal network is thus identical to the order of computations in the forward network. Whereas the original network corresponds to a nonlinear timeindependent system (assuming the weights are fixed), the rrciprocn2 network is a linear time-dependent system. The signals 6 , ( k ) that propagate through the reciprocal network correspond to the terms tl]/&,(k) necessary for gradient adaptation. Exact equations may then be "read-out" directly from the reciprocal network, completing the derivation. A formal proof of the validity and generality of this method is presented in Appendix A. 3 Examples
3.1 Backpropagation. We start by rederiving standard backpropagation. Figure 1 shows a hidden neuron feeding other neurons and an output neuron in a multilayer network. For consistency with traditional notation, we have labeled the summing junction signal s: rather than n,, and added superscripts to denote the layer. In addition, since multilayer networks are static structures, we omit the time index k. The reciprocal network shown in Figure 2 is found by applying the construction rules of the previous section. From this figure, we may immediately write down the equations for calculating the delta terms:
These are precisely the equations describing standard backpropagation. In this case, there are no delay operators and 6, = b,(k) aJl/as,(k) = 3CClearly, this set of rules is not a minimal set, i.e., a summing junction can be considered a special case of a multivariate function. However, we choose this set for ease and clarity of construction.
Derivation of Gradient Algorithms
187
Figure 1: Block diagram construction of a multilayer network
Figure 2: Reciprocal multilayer network.
[?leT(k)e(k)]/3s,(k) corresponds to an instanfancous gradient. Readers familiar with neural networks have undoubtedly seen these diagrams before. What is new is the concept that the diagrams themselves inay be used directly, completely circumventing all intermediate steps involving tedious algebra. 3.2 Backpropagation-Through-Time. For the next example, consider a network with output feedback (see Fig. 3) described by
(3.21 Y(k) = "X(k),Y(k - 111 where x ( k ) are external inputs, and y ( k ) represents the vector of outputs that form feedback connections. N is a multilayer neural network. If N has only one layer of neurons, every neuron output has a feedback
Eric A. Wan and Francoise Beaufays
188
Figure 3: Recurrent network and backpropagation-through-time.
connection to the input of every other neuron and the structure is referred to as a fzilly recurreiit netzoork (Williams and Zipser 1989). Typically, only a select set of the outputs have an actual desired response. The remaining outputs have no desired response (error equals zero) and are used for internal computation. Direct calculation of gradient terms using chain rule expansions is extremely complicated. A weight perturbation at a specified time step affects not only the output at future time steps, but future inputs as well. However, applying the reciprocal construction rules (see Fig. 3) we find immediately:
(3.3)
These are precisely the equations describing hllckyroprzgatioii-through-tinie, which have been derived in the past using either ordered derivatives (Werbos 1974) or Euler-Lagrange techniques (Plumer 1993). The diagrammatic approach is by far the simplest and most direct method. Note that the causality constraints require these equations to be run backward in time. This implies a forward sweep of the system to generate the output states and internal activation values, followed by a backward sweep through the reciprocal network. Also from rule 4 in the previous section, the product N’(k)6(k)may be calculated directly by a standard backpropagation of 6(k 1) through the network at time k.4
+
4Backpropagation-through-time is viewed as an off-lirzegradient descent algorithm in which weight updates are made after each presentation of an entire training sequence. An orz-he version in which adaptation occurs at each time step is possible using an algorithm called real-time-bnckprop~gntjon (Williams and Zipser 1989). The algorithm, however, is far more computationally expensive. The authors have presented a method based on flow graph iirferreciprorify to directly relate the two algorithms (Beaufays and Wan 1994a).
Derivation of Gradient Algorithms
189
rfkLd
I
Controller
I
Figure 4: Neural network control using nonlinear ARMA models 3.2.2 Backpropagation-Through-Time and Neural Control Architectures. Backpropagation-through-time can be extended to a number of neural control architectures (Nguyen and Widrow 1989; Werbos 1992). A system may be configured using full-state feedback or more complicated ARMA (AutoRegressive Moving Average) models as illustrated in Figure 4. To adapt the weights of the controller, it is necessary to find the gradient terms that constitute the effective error for the neural network. Figure 5 illustrates how such terms may be directly acquired using the reciprocal network. A variety of other recurrent architectures may be considered including hierarchical networks to radial basis networks with feedback. In all cases, the diagrammatic approach provides a direct derivation of the gradient algorithm. 3.3 Cascaded Neural Networks. Let us now turn to an example of two cascaded neural networks (Fig. 6), which will further illustrate advantages of the diagrammatic approach. The inputs to the first network are samples from a time sequence x(k). Delayed outputs of the first network are fed to the second network. The cascaded networks are defined as
u ( k ) = Nl[W,.x(k),x(k- l).x(k - 2 ) ]
(3.4)
y(k) = N2[W2,u ( k f ,u(k - 3 ) . u(k - 211
(3.5)
where W1 and W, represent the weights parameterizing the networks, y ( k ) is the output, and u ( k ) the intermediate signal. Given a desired response for the output y of the second network, it is straightforward to use backpropagation for adapting the second network. It is not obvious, however, what the effective error should be for the first network.
Eric A. Wan and Franqoise Beaufays
190
Plant
Controller
Model
i
Figure 3: Reciprocal network tor control using nonlinear ARMA models.
__
___
Figure 6 : Cascaded neural network filters and reciprocal counterpart.
Derivation of Gradient Algorithms
191
From the reciprocal network also shown in Figure 6, we simply label the desired signals and write down the gradient relations:
6,,(k) = & ( k )
+ S 2 ( k + 1)+ &(k + 2 )
(3.6)
with
[6i(k)
6 2 ( k ) 63(k)I =
(3.7)
- W k ) N[u(k)]
i.e., each S,(k) is found by backpropagation through the output network, and the 6,s (after appropriate advance operations) are summed together. The gradient for the weights in the first network is thus given by
in which the product term is found by a single backpropagation with & ( k ) acting as the error to the first network. Equations can be made causal by simply delaying the weight update for a few time steps. Clearly, extrapolating to an arbitrary number of taps is also straightforward. For comparison, let us consider the brute force derivative approach to finding the gradient. Using the chain rule, the instantaneous error gradient is evaluated as: (3.9)
+ =
ay(k) du(k - 2) d u ( k - 2 ) dW,
I
du(k - 1) du(k - 2 ) + 6 3 ( k ) aw, b I ( k ) g b2(k) dW,
+
(3.10)
where we define
Again, the 6,terms are found simultaneously by a single backpropagation of the error through the second network. Each product 6,(k)[du(k- i - 1)/ 3W1] is then found by backpropagation applied to the first network with 6,+l(k) acting as the error. However, since the derivatives used in backpropagation are time-dependent, separate backpropagations are necessary for each b,,, ( k ) . These equations, in fact, imply backpropagation through an unfolded structure and are equivalent to weight sharing (LeCun et al. 1989) as illustrated in Figure 7. In situations where there may be hundreds of taps in the second network, this algorithm is far less efficient than the one derived directly using reciprocal networks. Similar arguments can be used to derive an efficient on-line algorithm for adapting time-delay neural networks (Waibel et al. 1989).
Eric A. Wan and Frangoise Beaufays
192
Figure 7: Cascaded neural network filters unfolded-in-time. 3.4 Temporal Backpropagation. An extension of the feedforward network can be constructed by replacing all scalar weights with discrete time linear filters to provide dynamic interconnectivity between neurons. Mathematically, a neuron i in layer I may be specified as
(3.11) (3.12) where a ( k ) are neuron output values, s ( k ) are summing junctions, f ( . )are sigmoid functions, and W(q-') are synaptic filters5 Three possible forms for W(qp')are Case I I
M
Case I1
(3.13)
5The time domain operator q-' is used instead of the more common z-domain variable 2 - ' . The z notation would imply an actual transfer function that does not apply in nonlinear systems.
Derivation of Gradient Algorithms
193
a
Figure 8: Block diagram construction of an FIR network and corresponding reciprocal structure.
In Case I, the filter reduces to a scalar weight and we have the standard definition of a neuron for feedforward networks. Case I1 corresponds to a Finite Impulse Response (FIR) filter in which the synapse forms a weighted sum of past values of its input. Such networks have been utilized for a number of time-series and system identification problems (Wan 1993a,b,c). Case I11 represents the more general Infinite Impulse
194
Eric A. Wan and Fraiifoise Beaufays
Figure 9: IIR filter realizations: (a) controller canonical, (b) reciprocal observer canonical, (c) lattice, (d) reciprocal lattice. Response (IIR) filter, in which feedback is permitted. In all cases, coefficients are assumed to be adaptive. Figure 8 illustrates a network composed of FIR filter synapses realized as tap-delay lines. Deriving the gradient descent rule for adapting filter coefficients is quite formidable if we use a direct chain rule approach. However, using the construction rules described earlier, we may trivially form the reciprocal network also shown in Figure 8. By inspection we have
Consideration of an output neuron at layer L yields ? f ( k )= -2e,(k)f'[sF(k)]. These equations define the algorithm known as temporal hnckyropagafioii (Wan 1993a,b). The algorithm may be viewed as a temporal generalization of backpropagation in which error gradients are propagated not by simply taking weighted sums, but by backward filtering. Note that in the reciprocal network, backpropagation is achieved through the reciprocal filters W(q+*). Since this is a noncausal filter, it is necessary to introduce a delay of a few time steps to implement the on-line adaptation. In the IIR case, it is easy to verify that equation 3.14 for temporal backpropagation still applies with W(q+') representing a noncausal IIR filter. As with backpropagation-through-time, the network must be trained using a forward and backward sweep necessitating storage of all activation values at each step in time. Different realizations for the filters dictate
Derivation of Gradient Algorithms
195
how signals flow through the reciprocal structure as illustrated in Figure 9. In all cases, computations remain order N (this is in contrast with the order N2 algorithms derived by Back and Tsoi (1991) using direct chain rule methods). Note that the poles of the forward IIR filters are reciprocal to the poles of the reciprocal filters. Stability monitoring can be made easier if we consider lattice realizations in which case stability is guaranteed if the magnitude of each coefficient is less than 1. Regardless of the choice of the filter realization, reciprocal networks provide a simple unified approach for deriving a learning algorithm.6 The above examples allow us to extrapolate the following additional construction rule: A n y linear subsystem H(q-') in the original network is transformed to H(q+l) in the reciprocal system. 4 Summary
The previous examples served to illustrate the ease with which algorithms may be derived using the diagrammatic approach. One starts with a diagrammatic representation of the network of interest. A reciprocal network is then constructed by simply swapping summing junctions with branching points, continuous functions with derivative transmittances, and time delays with time advances. The final algorithm is read directly off the reciprocal network. No messy chain rules are needed. The approach provides a unified framework for formally deriving gradient algorithms for arbitrary network architectures, network configurations, and systems.
Appendix A: Proof of Reciprocal Construction Rules We show that the diagrammatic method constitutes a formal derivation for arbitrary network architectures. Intuitively, the chain rule applied to individual building blocks yields the reciprocal architecture. However, delay operators, which cannot be differentiated, as well as feedback, prevents a straightforward chain rule approach to the proof. Instead, we use a more rigorous approach that may be outlined as follows: (1) It is argued that a perturbation applied to a specific node in the network propagates through a derivative network that is topologically equivalent to the original network. (2) The derivative network is systematically unraveled in time to produce a linear time independent flow graph. (3) Next, the principle of flow graph interreciprocity is evoked to reverse the signal flow through the unraveled network. (4) The reverse flow graph is 'In related work (Beaufays and Wan 1994b), the diagrammatic method was used to derive an algorithm to minimize the output power at each stage of an FIR lattice filter. This provides an adaptive lattice predictor used as a decorrelating preprocessor to a second adaptive filter. The new algorithm is more effective than the Griffiths algorithm (Griffiths 1977).
Eric A. Wan and Franqoise Beaufays
196
raveled back in time to produce the reciprocal network. The input, originally corresponding to a perturbation, becomes an output providing the desired gradient. (5) By symmetry, it is argued that a11 signals in the reciprocal network correspond to proper gradient terms. 1. We will initially assume that only uiziuariate functions exist within the network. This is by no means restrictive. It has been shown (Hornik et a / . 1989; Cybenko 1989) that a feedforward network with two or more layers and a sufficient number of internal neurons can approximate any iuiifonnly coiitiiiiioits multivariate function to an arbitrary accuracy. A feedforward network is, of course, composed of simple univariate functions and summing junctions. Thus any multivariate function in the overall network architecture is assumed to be well approximated using a univariate composition. We may completely specify the topology of a network by the set of equations
I-
E
cfo.q - 7
(A.2)
where n , ( k ) is the signal corresponding to the node a, at time k. The sum is taken over all signals a,(k) that connect to a,(k), and T,, is a transmittance operator corresponding to either a univariate function (e.g., sigmoid function, constant multiplicative weight), or a delay operator. (The symbol o is used to remind us that T is an operator whose argument is a.) The signals a,(k) may correspond to inputs (a,%,), outputs (arey,), or internal signals to the network. Feedback of signals is permitted. Let us add to a specific node a' a perturbation An*(k)at time k. The perturbation propagates through the network resulting in effective perturbations Aa,(k) for all nodes in the network. Through a continuous univariate function in which u,(k) = f [ a l ( k ) ]we have, to first order:
where it must be clearly understood that Aa,(k) and &(k) are the perturbations directly resulting from the external perturbation Aa*(k).Through a delay operator, a j ( k ) = 9-'a1(k) = a , ( k - 1), we have
Combining these two results with equation A.l gives
au,( k ) =
cT:,
0
aa,( k )
vj
(A.5)
I
where we define T' E cf'[nl(k)]. q-'}. Note that f ' ( u i ( k ) )is a linear timedependent transmittance. Equation A.5 defines a derizmfiaenetwork that is
Derivation of Gradient Algorithms
197
Aa*(k) -
J:
Figure 10: (a) Time dependent input/output system for the derivative network. (b) Same system with all delays drawn externally.
topologically identical to the original network (i.e., one-to-one correspondence between signals and connections). Functions are simply replaced by their derivatives. This is a rather obvious result, and simply states that a perturbation propagates through the same connections and in the same direction as would normal signals. 2. The derivative network may be considered a time-dependent system with input Aa*(k) and outputs Ay(k) as illustrated in Figure 10a. Imagine now redrawing the network such that all delay operators q-' are dragged outside the functional block (Fig. lob). Equation A.5 still applies. Neither the definition nor the topology of the network has been changed. However, we may now remove the delay operators by cascading copies of the derivative network as illustrated in Figure 11. Each stage has a different set of transmittance values corresponding to the time step. The unraveling process stops at the final time K ( K is allowed to approach co). Additionally, the outputs Ay(n) at each stage are multiplied by - 2 e ( ~ z )and ~ then summed over all stages to produce a single output A]$ -2e(~~)~~y(n). By removing the delay operators, the time index k may now be treated as simply a labeling index. The unraveled structure is thus a time independent linear flow graph. By linearity, all signals in the flow graph can be divided by Aa*(k) so that the input is now 1, and the output is A]/Aa*(k). In the limit of small Aa*(k)
~ t = ~
Since the system is causal the partial of the error at time n with respect
Eric A. Wan and Francoise Beaufays
198
~
~~
Figure 11: Flow graph corresponding to unraveled derivative network.
I
I
I
1.0
Figure 12: Transposed flow graph corresponding to unraveled derivative network. to n * ( k ) is zero for
5
?d;w ;e;'
rl=k
iz
< k. Thus -
5
De(n)Te(i') da*(k) - an*(k) J' "h'(k) ~~
(A.7)
11=l
The term h * ( k )is precisely what we were interested in finding. However, calculating all the 6,( k ) terms would require separately propagating signals through the unraveled network with an input of 1 at each location associated with a , ( k ) . The entire process would then have to be repeated at every time step. 3. Next, take the unraveled network (i.e., flow graph) and form its transpose. This is accomplished by reversing the signal flow direction, transposing the branch gains, replacing summing junctions by branching points and vice versa, and interchanging input and output nodes. The new flow graph is represented in Figure 12.
Derivation of Gradient Algorithms
199
From the work by Tellegen (1952) and Bordewijk (19561, we know that transposed flow graphs are a particular case of interreciprocal graphs. This means that the output obtained in one graph, when exciting the input with a given signal, is the same as the output value of the transposed graph, when exciting its input by the same signal. In other words, the two graphs have identical transfer f ~ n c t i o n sThus, .~ if an input of 1.0 is distributed along the lower horizontal branch of the transposed graph, the final output will equal b*(k). This b*(k) is identical to the output of our original flow graph. 4. The transposed graph can now be raveled back in time to produce the reciprocal network. Since the direction of signal flow has been reversed, delay operators q-' become advance operators q+'. The node a*(k) that was the original source of an input perturbation is now the output S*(k) as desired. The outputs of the original network become inputs with value -2e( k ) . Summarizing the steps involved in finding the reciprocal network, we start with the original network, form the derivative network, unravel in time, transpose, and then ravel back u p in time. These steps are accomplished directly by starting with the original network and simply swapping branching points and summing junctions (rules 1 and 2), functions f for derivatives f ' (rule 3), and qpls for q+'s (rule 5). 5. Note that the selection of the specific node a * ( k ) is totally arbitrary. Had we started with any node a,(k), we would still have arrived at the same result. In all cases, the input to the reciprocal network would still be -2e(k). Thus by symmetry, every signal in the reciprocal network provides S,(k) = aJ1/3a,(k). This is exactly what we set out to prove for the signal flow in the reciprocal network. Finally, for multivariate functions, aout(k)= F[a,,(k)], where by our earlier statements it was assumed the function was explicitly represented by a composition of summing junctions and univariate functions. F is restricted to be memoryless and thus cannot be composed of any delay operators. In the reciprocal network, a,,(k) becomes b,,(k) and aout(k) becomes bout( k ). But
Thus any section within a network that contains no delays may be replaced by the multivariate function F( ), and the corresponding section in the reciprocal network is replaced with the Jacobian F'[ai,(k)]. This verifies rule 4. 5 7This basic property, which was first presented in the context of electrical circuits analysis (Penfield et al. 1970), finds applications in a wide variety of engineering disciplines, such as the reciprocity of emitting and receiving antennas in electromagnetism (Ramo et al. 1984), and the duality between decimation in time and decimation in frequency formulations of the FIT algorithm in signal processing (Oppenheim and Schafer 1989). Flow graph interreciprocity was first applied to neural networks to relate realtime-backpropagation and backpropagation-through-time by Beaufays and Wan (1 994a).
200
Eric A. Wan and Franqoise Beaufays
Acknowledgments This work was funded in part by EPRI under Contract RP801013 a n d by NSF under Grants IRI 91-12531 and ECS-9410823.
References Back, A., and Tsoi, A. 1991. FIR and IIR synapses, a new neural networks architecture for time series modeling. Neural Conzp. 3(3), 375-385. Beaufays, F., and Wan, E. 1994a. Relating real-time backpropagation and backpropagation-through-time: An application of flow graph interreciprocity. Neural Comp. 6(2), 296-306. Beaufays, F., and Wan, E. 199413. An efficient first-order stochastic algorithm for lattice filters. ICANN'94 2, 1021-1024. Bordewijk, J. 1956. Inter-reciprocity applied to electrical networks. AppI. Sci. Res. 6B, 1-74. Bryson, A,, and Ho, Y. 1975. Applied Optinznl Coiztrol. Hemisphere, New York. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Corifrol Sigizals Syst. 2(4). Griewank, A., and Coliss, G., (eds.) 1991. Automatic differentiation of algorithms: Theory, implementation, and application. Proc. First SlAM Workshop oiz Autoinatic Differelitintioil, Brekenridge, Colorado. Griffiths, L. 1977. A continuously adaptive filter implemented as a lattice structure. Proc. ICASSP, Hartford, CT, 683-686. Hornik, K., Stinchombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neirral Networks, 2, 359-366. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Conzp. 1, 541-551. Matsuoka, K. 1991. Learning of neural networks using their adjoint systems. Syst. Computers Jpii. 22(11), 31-41. Narendra, K, and Parthasarathy, K. 1990. Identification and control of dynamic systems using neural networks. I E E E Trails. Neural Netniorks 1(1),4-27. Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G., and Marcos, S. 1993. Neural networks and nonlinear adaptive filtering: Unifying concepts and new algorithms. Neural Comp. 5(2), 165-199. Nguyen, D., and Widrow, B. 1989. The truck backer-upper: An example of self-learning in neural networks. Proc. Iizt. Joiizt Coizf. on Neural Networks, 11, Washington, DC. 357-363. Oppenheim, A., and Schafer, R. 1989. Digital SigizaI Processing. Prentice-Hall, Englewood Cliffs, NJ. Parisini, T., and Zoppoli, R. 1994. Neural networks for feedback feedforward nonlinear control systems. I E E E Trans. Neural Netzilorks, 5(3), 436-439. Parker, D., 1982. Learning-Logic. Invention Report S81-64, File 1, Office of Technology Licensing, Stanford University, October.
Derivation of Gradient Algorithms
201
Penfield, P., Spence, R., and Duiker. S. 1970. Tellegen's Theorem and Electrical Networks. MIT Press, Cambridge, MA. Plumer, E. 1993. Optimal terminal control using feedforward neural networks. Ph.D. dissertation. Stanford University, San Fransisco, CA. Rall, 8. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science, Springer-Verlag, Berlin. Ramo, S., Whinnery, J. R., and Van Duzer, T. 1984. Fields and Waves in Communication Electronics, 2nd Ed. John Wiley, New York. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA. Tellegen, D. 1952. A general network theorem, with applications. Philzps Res. Rep. 7, 259-269. Toomarian, N. B., and Barhen, J. 1992. Learning a trajectory using adjoint function and teacher forcing. Neural Networks 5(3), 473484. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. I E E E Trans. Acoustics, Speech, Signal Process. 37(3), 328-339. Wan, E. 1993a. Finite impulse response neural networks with applications in time series prediction. Ph.D, dissertation. Stanford University, San Fransisco, CA. Wan, E. 199313. Time series prediction using a connectionist network with internal delay lines. In Time Series Prediction: Forecasting the Future and Understanding the Past, A. Weigend and N. Gershenfeld, eds. Addison-Wesley, Reading, MA. Wan, E. 1993c. Modeling nonlinear dynamics with neural networks: Examples in time series prediction. Proc. Fifth Workshop Neural Networks: Acadernic/lndustrial/NASA/Defense, WNN93/FNN93, pp. 327-232, San Francisco. Wan, E., and Beaufays, F. 1994. Network reciprocity: A simple approach to derive gradient algorithms for arbitrary neural network structures. WCNN'94, San Diego, CA, 111, 87-93. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA. Werbos, P.1992. Neurocontrol and supervised learning: An overview and evaluation. In Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, chap. 3, D. White and D. Sofge, eds. Van Nostrand Reinhold, New York. Williams, R., and Zipser D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 270-280.
Received April 29, 1994; accepted May 16, 1995.
This article has been cited by: 2. Ieroham S. Baruch, Carlos R. Mariaca-Gaspar. 2009. A levenberg-marquardt learning applied for recurrent neural identification and control of a wastewater treatment bioprocess. International Journal of Intelligent Systems 24:11, 1094-1114. [CrossRef] 3. K. Fujarewicz, M. Kimmel, T. Lipniacki, A. Swierniak. 2007. Adjoint Systems for Models of Cell Signaling Pathways and their Application to Parameter Fitting. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4:3, 322-335. [CrossRef] 4. T.G. Barbounis, J.B. Theocharis, M.C. Alexiadis, P.S. Dokopoulos. 2006. Long-Term Wind Speed and Power Forecasting Using Local Recurrent Neural Network Models. IEEE Transactions on Energy Conversion 21:1, 273-284. [CrossRef] 5. M. Bouchard. 2001. New recursive-least-squares algorithms for nonlinear active control of sound and vibration using neural networks. IEEE Transactions on Neural Networks 12:1, 135-147. [CrossRef] 6. Paolo Campolucci , Aurelio Uncini , Francesco Piazza . 2000. A Signal-Flow-Graph Approach to On-line Gradient CalculationA Signal-Flow-Graph Approach to On-line Gradient Calculation. Neural Computation 12:8, 1901-1927. [Abstract] [PDF] [PDF Plus] 7. A.F. Atiya, A.G. Parlos. 2000. New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Transactions on Neural Networks 11:3, 697-709. [CrossRef] 8. Stanislaw Osowski, Andrzej Cichocki. 1999. Learning in dynamic neural networks using signal flow graphs. International Journal of Circuit Theory and Applications 27:2, 209-228. [CrossRef] 9. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef]
Communicated by David Wolpert
Does Extra Knowledge Necessarily Improve Generalization?
The generalization error is a widely used performance measure employed in the analysis of adaptive learning systems. This measure is generally critically dependent on the knowledge that the system is given about the problem it is trying to learn. In this paper we examine to what extent it is necessarily the case that an increase in the knowledge that the system has about the problem will reduce the generalization error. Using the standard definition of the generalization error, we present simple cases for which the intuitive idea of "reducivity"-that more knowledge will improve generalization-does not hold. Under a simple approximation, however, we find conditions to satisfy "reducivity." Finally, we calculate the effect of a specific constraint on the generalization error of the linear perceptron, in which the signs of the weight components are fixed. This particular restriction results in a significant improvement in generalization performance. 1 Introduction
________~-
The employment of a priori knowledge in designing a learning machine is crucial to the success of the machine's ability to generalize well. Given that knowledge affects tlie generalization ability, our aim in this paper is to address tlie following question: does more knowledge necessarily improve generalization? Intuitively, the answer to this question would seem to be "yes," depending, of course, on tlie definitions of knowledge and generalization. Nevertheless, this question phrases a possible desiderata, which itself could affect tlie design of learning machines. We formulate the problem in the language of learning from examples (see, e.g., Hertz t7tnl. 1991). A training set of input/output pairs is generated by some teacher function, and the task is to find a student function whose outputs match closely the outputs of the teacher function on the training set. Constraints on the set of possible teacher functions that generate the training set are critical in narrowing down the search for a good student. Indeed, without any constraints it is an impossible task to find a student that generalizes to unseen examples (see, e.g., Wolpert 1992). A priori assumptions
Extra Knowledge and Generalization
203
are therefore made as to the form of the teacher, that is, restrictions are imposed on the space of teacher functions. Throughout this paper we assume that the spaces of the teacher and student functions are the same. The learning problem is then realizable in the sense that among the student space, there is a student that will match perfectly the output of the teacher on all possible inputs. We denote the teacher/student space of functions by F ( 9 ) , and a particular mapping as y = f(x. 0) for f E F ( Q ) and Q E 9, where the output is denoted by y, and the input by x. A particular mapping that a function performs is represented by the point ,9 in the parameter space Q. We assume that a teacher Bo generates the noise-free set of training data L = (x',f(xu. Q')}, where D = l..p indexes each element of the training set L. In the learning problem, one attempts to find a student f(x. Q) that matches the teacher f(x, Bo) on the training set.' To measure the extent to which the student has learned the teacher, an error measure t(Q.0'. x) is defined. The set of admissible students, represented by the parameter space Q E Q, is determined by the requirement of minimizing the error measure on all examples in the training set, and satisfying a priori constraints on the student. Hence 0 expresses all the information that the student has about the teachec2 In Section 2 we review briefly the definition of the generalization error, before formulating the original question more rigorously. We subsequently consider specific cases, beginning with the simplest possible-a one-dimensional version space. In Section 3, we analyze higher dimensions, using the linear perceptron as the function space F ( Q ) . In Section 4 we conclude with a summary of the main results of the paper and an outlook on further research. 2 General Theory
2.1 The Generalization Error. To measure how well the student performs on the p training examples, the training energy is formed, Et, o( C:=,t(O.QO,x").The student is found by minimizing the training error with respect to the parameter 8, while also adhering to additional a priori constraints. This is typically achieved by stochastic gradient descent, resulting in a posttraining distribution of students, P(Q 1 L ) 0: PP"(8) exp( -E,,/T) where the temperature, T , controls the randomness of the stochastic algorithm (see, e.g., Watkin et al. 1993). PP"(Q) is the a priori constraint on the student. In the limit of zero T, the distribution of students becomes uniform over those that have zero training error and satisfy the a priori constraints; this space of student functions 'Extra regularization conditions on the student, such as weight decay, will not be considered here. 2We briefly note that the assumption that the set of admissible functions is all that is known about the teacher function is found also in the PAC approach (see, e.g., Haussler 1994); in this paper, however, we address somewhat different issues.
204
David Barber and David Saad
is known as the version space (Watkin et al. 1993), which we denote by 0.3 In Section 3.3, we present results for nonzero T , but for the rest of the paper, zero T is implied. To find the expected error that a student makes on a random example input, termed the generalization function, we average the error over the input distribution, P ( x ) , giving q ( B , 0') = J dxP(x)t(B,B ' . x ) . ~ Hence, given the teacher, tr(0,0') measures the expected error that a student 0 makes, given that the teacher is 0' and that the student is 0. As the student does not know the teacher, we assume that 0 expresses all the information that the student has about the teacher. The generalization error is then defined as the expected performance of a randomly selected student from 0,given a randomly selected teacher from 0,
where (..)BcltO and represent averages over the version space 8.5 We write ~ ~ (to 0emphasize ) that the generalization error is a function of the version space. Intuitively, one expects that any further restrictions or a priori assumptions, resulting in a smaller version space, must necessarily reduce the generalization error. To test this intuition, we make the following definition.
Definition. F ( 0')is an "error reduced" function space of F ( 0 )if cg( (3')< fg(0)for 0'c 0, and we say that "reducivity" holds. In this paper we examine which subsets 0'of (3 are error reducing, according to the preceding definition. We mention briefly that one can also consider the generalization error for a fixed teacher, ~ ~ ( 0O' ). = ( f f ( Q , QO))ste, and check reducivity with the teacher assumed known. We show in a later section, however, that the main results of this paper also hold for ~ ~ ( 08), " .and concentrate accordingly on c g ( 0 ) . 2.2 One-Dimensional Version Space. We begin with the simplest possible case of a one-dimensional version space, assuming that it can be parameterized by a connected interval on the real line, which we write, without loss of generality, as [O,al. Furthermore, we assume that the generalization function can be written as, c f ( Q 0') % = g(1Q- @'I), for some function g(.).6c,(O) is then simply f g ( a ) = $d@P(H)JXdQoP(0')g(l6- @'I), 3The student distribution we consider is known also as exhaustive learning (see, e.g., Schwartz rt nl. 1990). 4An extension to the framework of this paper is to consider the off-training-set error (see, eg., Wolpert 1992) in which the expected error of the student is calculated for test examples not included in the training set. ?In this joint average of f t ( B 1So) over the version space, we assume independence of the student and the teacher: As the training set is fixed, we write P(0O.O 1 Is) = P(H 1 Ho, C)P(Bo 1 L ) . With the assumption P(B I Oo, C) = P(O I C), we have that I ) and 0" are independently distributed over 0. this assumption as to the form of the generalization function we have in mind a larger class of error measures than the square error measure, t ( H . H U . . u ) =
Extra Knowledge and Generalization
205
where P ( . ) is the parameter space distribution. For a uniform distribution, P(O) = P(OO)= l/a, and we can write
for which the requirement of reducivity, i.e., d ~ , ( a ) / d a > 0 becomes
This is equivalent to
where ( s )is~ the average value of g(.) over the interval [0.x]. For a monotonically increasing function, (g)" > (g)*(a > x), and thus reducivity holds for all monotonic increasing functions defined on the real line. Unfortunately, for higher dimensional cases, it is not generally possible to separate the dependence of the generalization function into a summation over the individual components of the parameter vector, i.e., F f ( N . Oo) # c:'g(lO,-Opl), where n is the dimension of the parameterization, and more complicated effects can appear. In the following sections we concentrate on the linear perceptron, beginning with an explicit example of a two-dimensional version space that violates the error reduction property. 3 The Linear Perceptron
For the noise free linear perceptron, the inputs are represented by ndimensional real vectors, x E R", and the output is a single valued real variable, y E !R (see, e.g., Hertz et al. 1991). The inputs x are assumed drawn independently and identically from a zero mean, unit covariance matrix gaussian distribution. The teacher outputs are given by f(x. w') = wo.x/Jtr. Similarly, the student outputs are f(x, w) = w.x/Jtr. The error measure is taken to be proportional to the squared difference between the teacher and student outputs, E(W,wo,x)= (w.x - w0.x)*/2n,which 2 nalso . impose the additional a priori gives, tt(w.wo) = (w - ~ ~ ) ~ /We spherical constraint on both the student and teacher, w.w = wo.wO= n. We proceed to analyze this model for a specific version space. 3.1 A Two-Dimensional Version Space. Let us consider the threedimensional linear perceptron. A point on the surface of a three-dimensional sphere of radius Y = is given by the ordered pair ( 4 .O ) , which represents the usual spherical polar coordinate parameteri~ation.~ 1/2 [ f ( x ,6') -f(x% 6'fl)]2, for which the assumption q(0, Ofl) = g(l0 - @ I ) would hold only for the linear function f(x, 8 ) = X B and g(s) = s2. 7201 = rcos(q5)sin(6'),w2 = rsin(q5) sin(S),w 3 = rcos(6') where, r = 4 for the spherical normalization condition.
206
David Barber and David Saad
Figure 1: A sphere of radius A. The shaded region represents the version space, 8 = {O E [0.4.0.6].+E [ 0 , 2 ~ ] }Making . 0 smaller by pushing the inner boundary toward the outer boundary does not result in a reduction in generalization error.
The generalization function is F ~ ( w wo) , = 1 - w.w0/3. We write this expression in spherical coordinates and average over the version space ] } give , given by 0 = { ( 4 . 0 ) % E$ [u.b].8E [ ~ ~ d to
F g ( 0 ) = 1-
~
(d - c)’
+
(A [cos(d)- cos(c)12 [sin(d)- sin(c)]’}
where X = 2 [l - cos(b - a)] / ( b - a)2. To violate reducivity we look for regions such that when we reduce the width of, for example, the interval [c.d], the generalization error increases. Without loss of generality, we search for regions for which i3tg(0)/dc>0, and we plot one such region in Figure 1. To find such a region explicitly, we look for the boundary
Extra Knowledge and Generalization
207
at which &,(O)/dc = 0, and define A(c.d) given by the equation
= X[&,(0)/3c
=
01,which is
sinc - sind + (d - c) cosc cosd - cosc (d - c) sinc
A=
+
In Figure 2a, we show how this relates to reducivity. In region (l),,I varies between 0 and 1, and &,(0)/ac can be of either sign, depending on the value of A; thus in region (11, reducivity depends critically on 6 = b-a. For X > A, 3 c g ( 0 ) / a c < 0, and for X < A, ac,(O)/ac > 0. In both regions (2) and ( 3 ) A @ [0,1] and, as X E [O. I] (Fig. 2b), the sign of ~ F ~ ( O )is/ & fixed, independent of [a. b]. In fact, in regions ( 2 ) and (3), reducivity is guaranteed. In region (2), as 6 decreases (i.e., [u.b] shrinks), &,(0)/3c becomes increasingly negative, whereas in region (31, for decreasing 6, dcx(0)/i3c becomes less negative. The boundary between regions (2) and ( 3 ) is given by the solution of cosd - cosc (d - c) sinc = 0. Despite the simplicity of the example, the behavior of reducivity on the sphere is nontrivial. At this point, the reader may well conjecture that reducivity would be guaranteed for convex regions 0 and 0' c 0.8 Perhaps somewhat surprisingly, we demonstrate in the next section that convexity is not a sufficient condition for reducivity.
+
3.2 Euclidean Approximation To The Version Space. For simplicity, we concentrate on version spaces small enough such that the region can be considered Euclidean. For the linear perceptron described above, this corresponds to a region small enough such that the curved surface of the hypersphere appears "flat." By writing w = c W, and wo = c Wo, where c lies in the space 0,we have q = (W - W ' ) ' /2n, and
+
-
Eg(0) =
1 212
- ((W
-W
O
)'>
+
W,W"&
where 6 is the approximately flat region on the sphere. As w and Wo are uncorrelated, this can be written in the form, 1
cg (0 ) = ;({W2)-WE@
.
-
(Wj:&)
We now consider an infinitesimal decrease in the space @ = 6 - A. For a uniform distribution over the space, and ignoring terms in A*, we can write, with a slight abuse of notation,
a
(3.1)
'In general, a region is convex if the geodesic connecting any two points lies wholly within the region itself.
208
David Barber and David Saad
d
0
1
2
3
4
6
C
0
2 6
Figure 2: The version space is the region on the sphere given by 0 = { (4,H ) , 4 E [a.b], H E [ c - d ] } . (a) In (1) reducivity depends on the region [a,b]. In (2) and ( 3 ) reducivity is guaranteed [&,(@)/& < 01. In (2), as [a, b] shrinks, L k g ( @ ) / a c becomes more negative, and vice versa in region (3). The region c > d is unphysical. (b) The function X versus 6 = b - a.
where A and 6 are the surface contents of A and 6, respectively. In equation 3.1, we have assumed, without loss of generality, that (w)GEh= 0, i.e., that the origin, c, is taken to be the centroid of 6. Reducivity holds then for the condition (3.2)
Extra Knowledge and Generalization
209
Figure 3: Counter example used to show that convexity is not a sufficient condition for reducivity. We take the hypotenuse to have length 2. The cross marks the position of the teacher for the example of reducivity violation for a given teacher.
Note that this is a general condition, holding for any dimension. Using this, we can show that convexity (for the linear perceptron at least) is not a sufficient condition for reducivity to hold. To do this, we observe that equation 3.2 will not be satisfied for regions, A, sufficiently close to the centroid, since then the left hand side of equation 3.2 will be small. This observation leads to the following two-dimensional counter example. Let the convex region 6 be the larger triangle as shown on Figure 3. By explicit calculation, one finds ntg(tri) = 2/9 for the marked angle y = ~ 1 2 We . now take @, a convex subset of 6, to be the trapezium as shown, for which, in the limit h-0, nc,(trap) = 1/3. Hence ~(6') > cg(6), demonstrating the insufficiency of convexity as a condition for reducivity.' At this point we refer to Section 2.1 and note that we can readily find an example of a fixed teacher for which an increase in the student's knowledge results in an increase in t,(Oo, 0). In the above trapezium/triangle example, consider a very flat triangle, for which y tends to T. We take the teacher to be positioned at the cross marked in Figure 3, for which, cg( x fri) = 1/6. Taking again, 6' to be the infinitely thin trapezium, we have cg( x . trap) = 1/3, which is larger than cg( x . tri). ~
'Note that the "distance" measure, .q = (w -Us,)' / 2 n is not a metric (it does not satisfy the triangle inequality). For the metric, cf = IW - w"//Zrz, ireg(tri) = 0.29, and mg(trap)= 0.32, such that reducivity is still violated, though not as severely.
210
David Barber and David Saad
The geometry of the above situation may appear somewhat pathological. Such nonreducive situations can, however, be constructed for essentially any version space (3. In passing, we mention another example to help clarify the situation. For a two-dimensional ellipse with minor and niajor axes n and b, respectively, one readily finds (Wz),lllp5c = (n’+h’)/4. We see then that for a circle ( h = n), all infinitesimal enlargements of the circle are ”expansions” in the sense that they satisfy equation 3.2. For an ellipse ( b > n ) we can violate equation 3.2 by choosing the point on the perimeter about which we wish to expand to be close to the centroid ((W’)A = n’) with 11 > &a. We note that this violation of reducivity occurs for an eccentricity ( b / n ) that is not much larger than unity. In general, such nonexpansive enlargements can occur for the following reason: the centroid represents the best-guess student (within the euclidean approximation); adding space as close as possible to this student increases the weight on the distribution of weight space close to this best-guess, decreasing F R . By examining equation 3.1, we note that the greatest decrease in generalization error is to be found for a region _1 farthest away from the centroid of the set. This is in line with the intuitive notion that we can improve generalization most by increasing our knowledge about the teacher in those regions that contribute most to the generalization error. One way to obtain this knowledge is to choose an input s such that the reply from the teacher yields information about the teacher in the desired region; this is the concept of query learning (see, e.g., Sollich 1994). The previous arguments have been aimed at infinitesimal, local alterations to ),(: and we consider briefly an example of global enlargement. We envisage situations in which the boundary of (l) can be expressed in a spherical coordinate system, I’ = ~ ( c . I .H . ..), which is the case for convex regions. The enlarged version space (1,‘ can then be defined by a new boundary, 1” = X ( o.H. . . )I’( o.H . ..), for some A ( o.N. ..) > 1. Assuming we can bound A by some extremum values, A, ,,,, < A((.). H. . . ) < ,,A, it is then a simple matter to forni an inequality such that the generalization error of the larger version space is greater than the generalization error of the smaller. For an enlargement X(o. H . . . ) that preserves the origin as the centroid of both (I> and (:),’ we require for reducivity in the two dimensional, Xi,,, > A,,,,,-sufficient, but by no iiieans necessary.
3.3 Sign Constrained Weights. To now we have considered lowdimensional version spaces; here we calculate the generalization error of an infinitely large perceptron for a specific weight constraint. The sign of each weight is predetermined according to sgn(w,) = pi, where each pi (i = l../z)is either il.This constraint has been studied previously in the context of pattern storage for the Hopfield network, for which it was found that the capacity was half that without the sign constraints (Amit ef ol. 1989).
Extra Knowledge and Generalization
0.0
0.5
211
1.o
1.5
2.0
a Figure 4: Comparison of the generalization error for the spherical constraint and the spherical-sign constraint. The curves beginning at 1 for a = 0 are the spherical constraint; the spherical-sign curves begin at 1 - 8 , ’ ~ for a=0. By writing the output of the perceptron as y = C x,sgn(w,)Iw, I, where 1.. 1 is the modulus, and transforming the inputs according to, x: = p,xl,the output can be written y = Cx:lw,l. As the input distribution is gaussian and hence symmetric, the analysis of the sign constraint is equivalent to that of constraining the weights to be positive. In addition, we retain the spherical constraint. The method of calculation is that of statistical mechanics, following closely the exposition given in Seung et al. (1992). This will enable us to obtain results for any temperature, and without recourse to the euclidean approximation employed in Section 3.2. As is required in statistical mechanics calculations, we define the limit of the dimension of the perceptron such that the number of training patterns is proportional to the dimension of the perceptron, i.e., p = an. A sketch of the calculation is given in the Appendix; as the calculation follows so closely that given by Seung et al. (19921, we refer the reader to that work, and point out only the major differences between our and their analyses. For the spherical constraint alone, the dimension of the version space (T= 0) reduces linearly with a, resulting in a linear reduction of the generalization error, tg= 1-a, Q _< 1. For the spherical-sign constraint, however, boundary effects result in a small deviation from linearity (Fig. 4). For T = 0 and (Y 2 1, the subspace of solutions collapses to a single point and tg = 0. Nonzero T results in an increase in generalization error,
David Barber and David Saad
212
affecting both the spherical and spherical-sign constraint similarly, such that for a given ( a .T ) .F : ~ <~ ciPh. For N = 0, there is no information about the teacher other than that imposed by the a priori constraint, and sign we have ciPh = 1, and fg = 1 - fi/.rr. 4 Summary
We have examined the effect of constraints on the generalization error of simple learning systems, concentrating in particular on the linear perceptron. Assuming that both the student and teacher lie in the version space of constraints, we studied what effect increasing the constraint, by decreasing the version space, has on the generalization error. For a connected one-dimensional case, in which we assumed that the error function is simply a monotonically increasing function of the separation between the student and teacher, we showed that decreasing the version space necessarily decreases the generalization error. This, however, is not the case for higher dimensional version spaces, and we presented an explicit example. Furthermore, neither convexity of the version spaces, nor a metric generalization function is sufficient for the smaller version space to have lower generalization error. In general it is a nontrivial problem to predict whether reducing the version space will reduce the generalization error, and each case must be treated explicitly. We found that the generalization error of the spherical linear perceptron decreases under the additional weight component sign constraint.
Appendix The sign constraint calculation follows closely that presented in Seung et al. (1992) and rather than entering into great detail, we refer the reader to Seung et al. (1992) and sketch here the main differences between the two calculations. The free energy is separated into two terms, F = Go - oGr, where only the term Go is affected by the constraints upon the weights. We use the same notation for the order parameters as those in Seung et 01. (1992), namely, that q is the normalized overlap between two replicas and R is the overlap between the student and the teacher. q and I? are conjugate order parameters arising from the definition of the order parameters q and R. We write Go as,
Go
=
1 --(1 - 9 ) 4 - R R 2
Dz is the n-dimensional gaussian measure,
( 27r-iz’2exp( -z.z/2)dz.The
Extra Knowledge and Generalization
213
weight vector distribution for the sign constraint is given by
2“ P(w) = --6(w.w - n)Q(w)dw V and the corresponding measure is ~ / L ( w =)P(w)dw, where V is the surface content of an n-sphere, and e(.) is the theta function. Introducing the integral representation for the delta function (which gives rise to the parameter A) and performing the saddle point approximation, we find that G F = G:ph Cr=,J l (A; 4; R. w’), where G p h is the contribution to the free energy given by the normal spherical constraint, as calculated in Seung et al. (1992), and J i is
+
There remains an explicit dependence on the teacher weight wo for which we average over teachers having the same measure as the students, to give,
where 4 = J(X + q ) / ( q + k2).For completeness, we state the further results necessary to find the free energy, namely
1 Gsph = X - -(1 2 1 G - ln[l ‘-2 ~
-1
-
+ /j(l
9)4 - RR-qq 2 -
q ) ]+
-
RR
1 R2+q ln(4X) + 2 4x
- -
~
[ j ( q - 2R + 1) 211 + P(1 - 4)1
The order parameters at T = l / p are found numerically by extremizing the free energy. The generalization error is then found from the relation, F g = 1 - R. Acknowledgments We thank Peter Sollich for many stimulating discussions. References Amit, D. J, Wong, K. Y. M., and Campbell, C. 1989. Perceptron learning with sign-constrained weights. J. Phys. A 22, 2039. Haussler, D. 1994. The probably approximately correct (pad and other learning models. In Foundations of Knowledge Acquisition: Machine Learning, A. Meyrowitz and S. Chipman, eds. Kluwer Academic Publishers, Boston.
214
David Barber and David Saad
Krogh, A., and Palmer, G. 1991. liitrodirc-tiori to /1ic Tlic~ovycif Neural Redwood City, CA. Schwartz, D. B., Solla, S. A., and Samanilan, V. K. 1990. Exhaustive learning. N?ll!’l?/COJllF7. 2(3), 374. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. R ~ T A . 45, 6056-6091. Sollich, I? 1994. Query construction, entropy and generalization in Neural Network models. Phys. Re;,. € 49, 46374651. Watkin, T. L. H., Rau, A., and Bielil, M. 1993. The statistical mechanics of learning a rule. Rev. Modrrii Pliys. 65, 449-556. Wolpert, D. H. 1992. On the connection between in-sample testing and genera h i t i o n error. Coiiipks Syst. 6, 17-94. Hertz,
J.,
C O J ? ? p f l t h ) f lAddison-Wesley, .
~~
Recel\ed July 22, lYY4, accepted M a \ 31, I O Y i
This article has been cited by: 2. T. S. Hu, K. C. Lam, S. Thomas Ng. 2005. A Modified Neural Network for Improving River Flow Prediction/Un Reseau de Neurones Modifie pour Ameliorer la Prevision de L'Ecoulement Fluvial. Hydrological Sciences Journal 50:2, 1-318. [CrossRef] 3. M.M. Kantardzic, A.A. Aly, A.S. Elmaghraby. 1999. Visualization of neural-network gaps based on error analysis. IEEE Transactions on Neural Networks 10:2, 419-426. [CrossRef] 4. Annette Karmiloff-Smith, Julia Grant, Ioanna Berthoud, Mark Davies, Patricia Howlin, Orlee Udwin. 1997. Language and Williams Syndrome: How Intact Is "Intact"?. Child Development 68:2, 246. [CrossRef]
Communicated by John Rinzel
ARTICLE
Encoding with Bursting, Subthreshold Oscillations, and Noise in Mammalian Cold Receptors Andre Longtin Karin Hinzer Dipartenzent de Physique, Uniuersite d’Ottawa, 150 Louis Pasteur, Ottawa,Ontario, Canada K I N 6N5
Mammalian cold thermoreceptors encode steady-state temperatures into characteristic temporal patterns of action potentials. We propose a mechanism for the encoding process. It is based on Plant’s ionic model of slow wave bursting, to which stochastic forcing is added. The model reproduces firing patterns from cat lingual cold receptors as the parameters most likely to underlie the thermosensitivity of these receptors varied over a 25°C range. The sequence of firing patterns goes from regular bursting, to simple periodic, to stochastically phase-locked firing or ”skipping.” The skipping at higher temperatures is shown to necessitate an interaction between noise and a subthreshold endogenous oscillation in the receptor. The basic period of all patterns is robust to noise. Further, noise extends the range of encodable stimuli. An increase in firing irregularity with temperature also results from the loss of stability accompanying the approach by the slow dynamics of a reverse Hopf bifurcation. The results are not dependent on the precise details of the Plant model, but are generic features of models where an autonomous slow wave arises through a Hopf bifurcation. The model also addresses the variability of the firing patterns across fibers. An alternate model of slow-wave bursting (Chay and Fan 1993) in which skipping can occur without noise is also analyzed here in the context of cold thermoreception. Our &udy quantifies the possible origins and relative contribution of deterministic and stochastic dynamics to the coding scheme. Implications of our findings for sensory coding are discussed. 1 Introduction
Mammalian cold receptors exhibit a fascinating array of firing patterns at different steady-state temperatures over a 25°C range. Figure 1 illustrates how constant temperatures are encoded by lingual cold receptors of the cat (Schafer et al. 1988). As temperature increases from 15”C, the interburst period, the duration of the active phase of the burst and the number of spikes per burst decrease (Fig. 1A). At low temperatures, half Nenral Computation 8, 215-255 (1996)
@ 1996 Massachusetts Institute of Technology
216
Andre Longtin and Karin Hinzer
Figure 1: (A) Characteristic discharge patterns of bursting cold receptors of the cat lingual nerve at different constant temperatures (from Fig. 1 of Schafer et nl. 1988). The steady-state patterns are recorded at least two minutes after each 5°C change. Interspike intervals are digitized with a resolution of 1 msec. The ill ziivuo recording method is described in Bade et al. (1979). (8)Same as in (A), but for a different single fiber. The mean periods of the firing patterns are different from those at corresponding temperatures in (A), highlighting the variability across fibers. Here skipping appears around T = 30°C; double spikes are sometimes seen (from Fig. 1 of Braun et d.1990). Note the increase of the interspike intervals during the active phases of the bursting patterns. (C) Interval histograms (bin width is 5 msec) and spike sequences at constant temperatures above and below the regular bursting temperature range (from Braun et a/. 1980). (Figure 1A reproduced with permission of Elsevier Science BV. Figures 1B and 1C reproduced with permission of Springer-Verlag.)
Encoding in Mammalian Cold Receptors
217
the fibers show sporadic firing, while others are either silent or exhibit bursting across the low temperature range (Fig. lB,C). At mid-to-high temperatures, regular trains of mostly single spikes can be seen. In the higher temperature range, the so-called "irregular" discharge patterns are observed: spikes are phase-locked to an underlying periodic oscillation in the receptor, but a random integer number of cycles of this oscillation is skipped between spikes. These skipping patterns produce multimodal ISIHs with peaks at integer multiples of the period of the receptor oscillation (often u p to eight modes can be seen). The bursting patterns yield interspike interval histograms (ISIH) with events grouped around the intraburst and interburst periods. There is a large variability in response across afferent fibers from a population of cold receptors. A given fiber will exhibit one, or many, but seldom all of these patterns as temperature is increased. Furthermore, the temperature at which a given pattern is observed varies across fibers. The smoothness of the continuity between these firing patterns as temperature is increased is striking. Another interesting feature of these data is the interplay of precisely timed patterns and aperiodicity, the latter being manifest in the skipping, and in the fluctuation of the number of spikes per burst and of the intra- and interburst periods. Finally, the variability across fibers, not fully represented in Figure 1, is intriguing. This raises the question of whether a model for the ionic events of bursting can account (1) for the aperiodicity seen in single fibers, e.g., by producing complex periodic patterns with long transients or deterministic chaos, and (2) for the variability across fibers, e g , through high sensitivity of dynamic behaviors to parameter fluctuations. Another possibility is that high-dimensional noise is required to obtain a satisfactory description. In particular, if noise is important, why does it significantly affect high temperature patterns, e.g., by deleting spikes from an otherwise periodic pattern, without much altering the basic bursting patterns at midto lower temperatures? The issue is further complicated by the fact that temperature ultimately affects noise levels, which in turn affect the rates of various kinetic processes. One must then decide how to model the main effects of temperature. Cold receptors are free nerve endings, and their small size has precluded direct recording of the ionic events governing their excitability. This means that intracellular voltage time series are not available: one has access only to spike trains, i.e., sequences of times at which firings occur. Perhaps as a consequence of this fact, no detailed ionic models have, to our knowledge, been proposed to account for the dynamic behaviors shown in Figure 1. The same holds for other similar preparations (see Braun et al. 1984a), such as the ampullae of Lorenzini, whose patterns are similar to those in Figure 1, but follow different sequences as temperature varies. The known models have been more of the descriptive rather than mathematical type (Braun et nl. 1990; Schafer et 01. 1990). Many studies of cold receptors have concluded that their mechanisms
218
Andri. Longtin and Karin Hinzer
of bursting as well as their thermosensitivity should be similar to those of pacemaker cells of molluscs. In this paper, we show that a stochastic version of Plant’s model for slow wave bursting (Plant 19811, originally proposed for the pacemaker activity of the R15 cell of Aplysia, exhibits the proper array of firing patterns as certain parameters are varied together in step with the temperature. Our study addresses the dynamic mechanism behind the patterns of Figure 1, the origin of the aperiodicity of the firing pattern for a given cell, and of the variability across cells. The paper is organized as follows. Section 2 reviews the encoding properties of cold receptors. Section 3 is a summary of the relevant physiology of cold receptors. The deterministic and stochastic aspects of our model are explained in Section 4. The simulated firing patterns are analyzed in Section 5. The enhanced importance of noise at higher temperatures is described in Section 6, along with other sources of variability of firing patterns within a given receptor and across a population of receptors. The effect of the maximal conductance of the slow current is discussed in Section 7. This section further considers an alternate nieclianism of thermoreception based on the model of Chay and Fan (1993), in which skipping arises without noise. The role of noise and aperiodicity in the encoding process is discussed in Section 8, and the paper concludes in Section 9 with a summary of results and an outlook toward future investigations.
2 Encoding by Cold Receptors Neurons involved in thermoreception at the periphery must operate over a large range of temperatures. A cold receptor is defined as a cell whose mean firing rate increases as the temperature decreases below the normal physiological set point. However, the curve of adapted mean firing rate versus temperature has a maximum, and the mean rate decreases again when the temperature is sufficiently cold (see, e.g., Hensel 1974). This leads to an ambiguous determination of temperature by higher order neurons, since a given mean rate corresponds to two different teniperatures. Iggo and Iggo (1971) were the first to suggest that the different temporal patterns of firing seen at higher and lower temperatures could be used centrally to resolve this ambiguity. It is known that small variations in steady skin temperature do alter thermoregulatory responses, demonstrating that accurate information about steady temperatures is available centrally. In fact, Dykes (1975) has shown that the temporal organization of the impulses provides sufficient information during steady temperatures to account for observed thermoregulatory responses. It has also been shown that different steady temperatures a t the periphery modify the ISIH of neurons in the preoptic area and hypothalamus, even though their mean firing rate may not vary. There is further evidence that
Encoding in Mammalian Cold Receptors
219
the static firing patterns of central neurons in the medulla and thalamus closely approximate those seen in the periphery (Poulos 1981). The temporal pattern of spikes in the adapted discharge has come under increasing scrutiny as an oscillatory theory for temperature encoding became attractive (Braun ef al. 1980). Based on anatomical, physiological, and pharmacological evidence, these authors have postulated that mammalian cold receptors use noise and an internal or "endogenous" oscillation to encode steady state temperatures. This same group has recently shown that shark multimodal sensory cells rely on the interplay of noise and oscillating neural activity to differentiate between thermal and electrical stimuli (Braun ef al. 1994). According to this theory, the amplitude (throughout our study, we mean "peak to peak amplitude") of the endogenous oscillation in the cold receptor dips below the spiking threshold at high temperature, opening the way for noise-induced firing in synchrony with this oscillation. This scheme was confirmed in their analog simulation study (Braun et al. 1984b: see below). It is interesting to note that skipping is typically seen in other sensory modalities that encode oscillatory rather than constant stimuli (e.g. Rose et al. 1967; Scheich et al. 1973). These latter neurons are tunable in the sense that their ISIH will display modes lined u p with integer multiples of the imposed oscillation. In the cold receptors, there appears to be an endogenous oscillation, the characteristics of which are affected by temperature. Thus the bursting pattern, the mean rate, the subthreshold oscillation frequency, and the actual interval distribution can carry information centrally. If one assumes, as d o Braun et al. (1984a), that noise extends the encoding range by producing skipping, it would appear to play a useful role. This notion of "useful noise" has also been suggested in other studies. For example, noise can smooth out nonlinear behaviors due, e g , to phase-locking or rectification and thus linearize input-output relationships (Spekreijse 1969; French etal. 1972). It has also been shown to enhance the detectability of subthreshold signals (Hochmair-Desoyer et al. 1984; Longtin 1993; Chialvo and Apkarian 1993; Douglass ef al. 1993).
3 Summary of Relevant Electrophysiology of Cold Receptors
~
The following facts are based on anatomical, pharmacological, and electrophysiological evidence. This evidence is indirect, as it is provided uniquely by extracellular recordings. It has been hypothesized that cold receptor activity in a variety of mammals is governed by identical mechanisms (see Schafer ef al. 1990 and references therein). There is evidence that the oscillatory processes in cold receptors are independent of whether or not action potentials are generated (Schafer ef al. 1988). Also, the mean durations of the active and silent phases of the bursting pattern are not affected by the number of spikes per burst. The periodic
220
Andre Longtin and Karin Hinzer
activity of tlie cold receptor is very sensitive to calcium, with increases in extracellular calcium diminishing the iiumber of spikes per burst. The passive conductances are thought to involve a calcium-activated outward conductance, based on the dependence of the firing activity on the level of external calcium (Schafer ct 01. 1982). This outward conductance, which is presumably a potassium conductance, counteracts tlie depolarization induced by an accumulation of intracellular calcium through a voltage-dependent low thresliold slow calcium conductance. It has been reported that the properties of the channel associated with the slow inward current are more of the LVA type (Schafer cf nl. 1988). It is not clear whether other ions such as sodium are also involved in this slow current. The action potential upstroke seems to i n l d v e sodium ratlier than calcium (Schafer c-f id. 1988). Also, the interspike intervals typically increase throughout the active phase. The action potentials are initiated i n the afferent axon. Willis cf 01. (1971)have proposed that pacemaker neurons of tlie invertebrate Ayljpia may be excellent models of mammalian thermoreceptors. An excellent re\.iew of this question can be found in Wiederhold and Carpenter ( 1982). The implication is that tlie thermosrnsitivity of cold receptors is governed by the same mechanisms as that of pacemaker neurons. Their discharge is not dependent on tlie activity oi other cells. Apart from the well-known increase in the rate of the kinetic processes modeled by Hodgkin-Huxley-type gating variables (Hodgkin and Huxley 19521, their tliermoseiisiti\,ity rests on two basic mechanisms (Carpenter 1981; Braun vt 01. 1990). The ratio of the maximum pernieabilities G,,,Gh increases with temperature, since the Q,,, of G,,,is higher than that of GL. The ensuing depolarizing effect confers a " u w m receptor" character to these cells. Also, the activity of tlie electrogenic Na/K pump increases with temperature. As this pump has a livperpolarizing effect, i t underlies the "cold receptor" property. The combination of these two effects could then be responsible for the bell-shaped "mem rate vs. temperature" characteristic of cold receptors. Ouabain, a known inhibitor of tl& pump, causes an increase in tlie mean firing rate of cold receptors tl'ierau t'f (11. 1975). This result confirmed the presence of an electrogenic Na/K pump in these receptors. Carpenter and Alving (1968) have studied the millivolt-per-degree-Celcius contribution o f this pump to tlie resting potential of Aid!/:5 i r i neurons. The critical firing threshold in Aphysin pacemaker neurons has been 5hoLsn to remain constant despite the changes in resting potential seen a t different temperatures (Carpenter 1967). Finally, tlie niean level of depolarization and the temperature are thought to have the main influcnces on the (peak-to-peak)amplitude, frequency, and niean value of tlie endogenous oscillation. These characteristics ultimately determine the firing patterns.
Encoding in Mammalian Cold Receptors
221
4 Model 4.1 Deterministic Dynamics. 4.1 .1 Modeling Choices. Since the physiological data are indirect, we seek a mechanism underlying the patterns of Figure 1, which is general rather than a property of a very specific model. This is especially desirable in view of the aforementioned variability across fibers. Our goal is to find an ionic model that incorporates an endogenous slow oscillation as well as action potential dynamics, and that reproduces the patterns in Figure 1A as parameters are varied with temperature in a biophysically plausible way. The analog computer model of Braun et al. (1984b) is a useful guide toward such a model. It illustrates how a sine wave, a dc component and noise at the input of a leaky integrator neuron can, in the right proportions, produce all the observed firing patterns. In their model, skipping arises when noise interacts with the subthreshold oscillation. There are many classes of models of bursting (Rinzel 1987; Chay et al. 1995). One possible classification is based on whether or not the bursting pattern depends on the firing of action potentials (Rinzel and Lee 1986). For spike-driven bursting, the fast dynamics of the system exhibits bistability. Bursting then occurs when a hysteresis loop is traced out around the two branches as a slow variable (such as calcium concentration) oscillates. For slow wave bursting, there are disjoint regions in the subspace of the slow subsystem in which the fast subsystem (governing action potentials) is either in a steady state/excitable mode or in a repetitively firing mode. The physiological data, summarized in Sections 2 and 3, strongly suggest that the bursting does not rely on action potentials, and therefore that a slow wave bursting mechanism would be appropriate. Plant (1981) has proposed a model for bursting in the R15 pacemaker cell of ApIysia. In this model, the intracellular calcium concentration [Ca2+Iiis coupled to the membrane potential through its activation of outward currents. It is based on a simplified version of Hodgkin-Huxleytype equations from earlier studies of the effects of calcium on bursting neurons (Plant 1978; Plant and Kim 1976). The following sequence of events give rise to the bursting. A slow voltage-dependent inward calcium current leads to an accumulation of calcium ions in the cell. When the Ca2+ concentration reaches a certain level, a calcium-dependent K+ current activates. This causes an outward K+ current that repolarizes the cell; consequently, the active phase of the burst and the inward Ca2+ current cease. During the interburst period, calcium is removed from the intracellular space, leading to a decrease in the outward K+ current. This is accompanied by a slow depolarization, leading to reactivation of the slow calcium current. Eventually the active phase onsets, with activation of the fast currents responsible for the action potentials.
222
Andre Longtin and Karin Hinzer
Other descriptions of slow wave bursting have since been proposed. The more accurate neuronal model of Chay (1983) is built on the Plant model and shares much of its dynamics. However, it allows for the possibility that the calcium concentration is actually a fast variable. Decreasing the rate constants X of the gating variables in this seven-dimensional model leads to lengthening of the bursting period. Also, a subsequent study of that model (Chay 1984) has shown that cooling through reduction of the As produces a transition from beating (repetitive firing) to bursting. The period however does not increase monotonically as in Figure lA, although this is probably due to the presence of chaos. A monotonic increase may occur if other parameters are varied concomitantly with temperature. Nevertheless, this model would be an excellent candidate to study the patterns of Figure 1, as are more recent models found in Chay rt al. (1995). In particular, the model of Chay and Fan (1993), discussed in Section 7.2, has only one slow variable and does not rely on [Ca’+I,. Bursting in this model is also of slow wave-type, and bifurcation sequences can be found that produce a sequence of patterns similar to the one shown in Figure 1A. In spite of these and many other models of slow wave bursting such as the 11-current model of the R15 pacemaker cell of Aplysin by Canavier et al. (19911, we have opted for the Plant model, for three reasons: (1) we wish to keep the model as simple and general as possible, (2) there exists an excellent analysis of the nonlinear dynamics of the Plant model (Rinzel and Lee 1987) that helps us understand the effect of noise in the vicinity of some of its bifurcations, and (3) this model easily incorporates the mechanisms (Section 3) currently accepted as underlying the thermosensitivity of mammalian cold receptors. In particular, it produces slow wave bursting in which [Ca’+], and the slow inward current are “slow variables,” as in the recent detailed model of the R15 cell by Canavier rt al. (1991, 1993). The model proposed here provides a framework from which to proceed; modifications will surely be straightforward as intracellular electrophysiological data become available. 4.1.2 Exteizsiotz ofdieMode1 ofP1niif. The differential equations describing Plant’s model are given by equations 4.2-4.7 in Section 4.3 below. An extra equation governing the dynamics of the noise, equation 4.8, has been coupled to this system (see Section 4.2). For simplicity, a spike in a realization of the stochastic process equations 4.2-4.8 is counted as a propagated spike if it reaches a threshold chosen as 0 m V (Carpenter 1967). Among other interesting features of the Plant model is its ability to produce bursts in which the successive interspike intervals increase, as seen in the data (Braun rf 01. 1980). This behavior has been explained by Rinzel and Lee (1987). Since it is not firmly established that the fast inward current in cat lingual cold receptors involves only Na+, we keep Plant’s original formulation (mixed Naf and Ca’+) rather than that of Rinzel and Lee (1987).
Encoding in Mammalian Cold Receptors
223
Likewise, we adopt the calcium-dependent potassium conductance mechanism for the hyperpolarizing phase of the slow wave rather than a calcium inactivation of the calcium conductance, even though there is now much evidence for the latter mechanism in the R15 cell of Aplysia (see, eg., Kramer and Zucker 1985; Canavier et al. 1991). Not much difference is expected since these two dynamic pictures are very similar (Rinzel and Lee 1987). The maximal conductance GI (corresponding to the mixed fast inward currents) as well as the maximum conductance of the fast outward delayed rectifier GK will be made temperature-dependent, in accordance with the facts reported in Section 3. GI and GK are given QIOSof, respectively, 1.4 and 1.1, corresponding to the values of G N and ~ GK obtained by Hodgkin and Keynes (1955). In the following, we will refer to GI as GNa for simplicity. We set G, = 0.01 (Plant’s GT) for the slow inward conductance (mixed Na+ and Ca2+: Plant 1981). If only the As, the pump current, and the G N ~ / G K ratio are varied, satisfactory agreement with Figure 1A is obtained only over a narrow range of temperatures. Agreement over a wider range is possible if the rate of calcium kinetics p is also varied in step with temperature. There is evidence that removal of intracellular calcium is temperature sensitive (Barish and Thompson 19831, with cooling likely to cause a decrease in rate of removal, i.e., a decrease in p in equation 4.7 (Kramer and Zucker 1985). For simplicity, we will vary only p, and assign to it the same Qlo as the As. Many metabolic pumps have been identified in molluscan neurons (see, e.g., Canavier et al. 1991). Based on the facts reported in Section 3, we model only the electrogenic Na/K pump in the hope that this will capture the main effect of temperature on the electrogenic pumps. It is modeled simply as a constant additive current I,, as suggested in Junge and Stephens (1973). This current has a hyperpolarizing effect that increases upon warming (Carpenter and Alving 1968);it is not an important effect at low temperatures (Willis et al. 1974). Consequently, and for the sake of simplicity, I , will be made a linearly decreasing function of temperature. We will not consider variations with temperature of the leak current, which in practice has contributions from various pumps. Also, all other parameters, including reversal potentials and maximal conductances (except G N and ~ GK), will be considered fixed. 4.2 Stochastic Dynamics.
4.2.1 Hypotheses on Variability. Plant’s model can exhibit bursting patterns, as well as high frequency and low frequency beating patterns. It can also produce aperiodic patterns as a result of chaotic motion (see Section 6). Chaos occurs in many models of bursting (see, e.g., Chay 1984; Hindmarsh and Rose 1984; Chay and Rinzel 1985; Canavier et 01.
224
Andri. Longtin and Karin Hinzer
1993; for a recent review, see Chay et nl. 1995), and may account for some of the aperiodicity in Figure 1. Except for a very narrow parameter range (Section 7.1), Plant’s model cannot, to our knowledge, produce skipping patterns, even as the parameters are varied with temperature as described above. Another model in which deterministic skipping occurs is discussed in Section 7.2. Based on the intuitions offered by Braun etal.’s (1984b)analog simulations, and of studies of skipping in other sensory systems (the earliest to our knowledge is by Gerstein and Mandelbrot 1964; see also HochmairDesoyer et al. 1984; Longtin 1995a), it appears likely that various sources of noise are influencing the ionic dynamics of the cold receptors. Matliematically, this means that stochastic forces are coupled to the deterministic equations, which are thus converted to generalized Langevin-type equations (Horsthemke and Lefever 1984). These stochastic forces are known to exist in all nerve cells and fall under the comiiion heading of “membrane noise” (DeFelice 1981). Although at a fundamental level temperature determines “noise” levels, in neurophysiology its main effect is usually incorporated into the rate constants of the g a t h g variables in the Hodgkin-Huxley formalism. In other words, temperature appears as a parameter in deterministic equations. It is as if the gating variables amplify the effect of temperature, since their kinetics have high Qlos. However, temperature also produces noise that is not accounted for deterministically. We hypothesize, in the spirit of the analog simulations of Braun Ct 01. (1984b), that skipping occurs at high temperature because (1) the slow wave becomes subthreshold, due mainly to the hyperpolarizing effect of the electrogenic pump, and (2) noise randomly induces spiking by bringing the slow wave sufficiently close to the threshold for the fast spiking dynamics. In the Plant model, the transition between fixed point and repetitive firing for the fast dynamics has been shown to be of homoclinic type (Rinzel and Lee 1987). Thus, noise induces random crossings of this boundary, with the crossing probability being much higher near the crests of the slow wave. This would explain the phase preference of skipping, a simple extension to the autonomous case of the mechanisms of skipping with external forcing (Longtin 1995a). Thus, as temperature increases, there is a smooth transition from the bursting patterns with two time scales (neglecting the very fast time scale of the action potentials) to beating and skipping patterns with one time scale (the interspike period).
4.2.2 Me~zbriztzcNoise airif Tenzperature. Modeling noise in neural systems is a difficult task. Most studies have focused on preparations (such as the squid axon) with simpler dynamics than those of cold receptors. There are many known sources of membrane noise (Stevens 1972). Each kind has its own spectral properties and probability distribution. The theoretical descriptions of these properties rest on various assumptions regarding, e.g., the proximity of membrane potential to its resting value,
Encoding in Mamnialian Cold Receptors
225
the applicability of "equilibrium conditions," and (e.g., in the case of conductance fluctuations) the particular microscopic models used. The temperature dependence of a noise and its coupling to the deterministic equations are also problems. In fact, noise sources are not necessarily independent. The strength of one noise can depend on the mean values of other state variables, which in turn depend on other noise sources. This has been discussed by Lecar and Nossal (1971) in their analysis of threshold responses using a reduced voltage-conductance description with noise on both variables. Thermal current noise can be seen as an additive current noise. For a constant voltage, its spectrum is given by
[$1
W(f) = 4kTRi~
where Z[f) is the membrane impedance (DeFelice 1981). If the impedance is that of a parallel RC circuit, the spectrum becomes a constant proportional to temperature. Thermal noise is proportional to the absolute temperature, and thus is expected to increase only by a few percent over the range of interest here (see, e.g., Clay 1977). Conductance noise has been assumed to be the dominant effect on threshold responses in squid giant axon, regardless of temperature (Clay 1977). The intensity of this noise can decrease with temperature. Conductance fluctuations are mostly due to the potassium channel, and their spectrum has been approximated by the sum of a Lorentzian and of 1/f noise (DeFelice 1981). The cutoff of the Lorentzian increases with temperature, signifying an increase in bandwidth. Conductance noise is a multiplicative noise since it affects the Gs in equation 4.2. Electrogenic ion pumps also contribute to the current and voltage noise of cellular membranes (Laiiger 1991). The power spectrum of pump current noise has the shape of a sigmoidally increasing function, which is quite different from the Lorentzian-type spectrum of conductance fluctuations. Finally, actual fluctuations in temperature are another source of noise and depend on the precise environment of the cell; we have not found any measurements of these fluctuations. Given the complexity of the situation regarding the modeling of noise in the best of cases (squid axon), and the fact that intracellular data are not available for our study, our model for the noise can only be conjectural. We simulate the membrane noise by a simple additive term on the current balance equation equation 4.2. It includes pump noise, thermal noise, and the lumped effects of conductance noise. For simplicity the noise is modeled as a continuous-time Ornstein-Uhlenbeck (OU) process rl( t ) . This noise is gaussian distributed and exponentially correlated. Its spectrum is a Lorentzian with a bandwidth of TC' Hz, where T~ is the noise correlation time. As the bandwidth and intensity Le., variance) D/T' can be adjusted, this noise should provide information about the basic behavior of the Plant model in the presence of stochastic forcing.
Andri. Longtin and Karin Hinzer
226
4.3 Choice of Parameters. The model with stochastic forcing is governed by the equations:
dV
CM-
dt
=
G ~ ? ? z : ( V ) ~-(VV )I+ G,x(V*- V )+ G K / I ~ (-V V K ) (4.2)
(4.5) (4.6)
(4.7) Here I / ( t ) is the OU process, obtained by low-pass filtering ( ( t ) ,a gaussian white noise: (((f)) = 0 and (<(f)<(s))= 2Dh(t - s) (the quantity 2 0 is the variance of the white noise). The specific forms for the voltage dependencies of the gating variables and time constants can be found in Plant (1981). Plant's factor of 12.5 in his time constants is not included in our value of A. The correlation time of the OU process was chosen as rc = 1.0 msec, so that the noise has a larger bandwidth than the fastest events in the deterministic equations. In fact, our parameters (see below) yield action potential durations of 140 msec at T = 17.8"C down to 20 msec at T = 40°C. The time scale of all time series, spectra, and statistics presented below is approximately 25 times longer than that of the data in Figure 1. The idea behind the parameter calibration is to establish a range of values for GN,, GK, A, and p centered on those of Plant's original model. This range is determined by the respective Qlos of these parameters: 3 for the As and p, 1.4 for G N ~and , 1.1 for GK. The parameters are set to values corresponding to T = 40°C, and the pump current Ip is adjusted until skipping is seen. In doing so, we have found that p, A, and the noise intensity have to be in suitable proportions, since too much noise or a value of X that is too high produces too much burstiness in the skipping. The range of A was shifted lower and that of p higher with respect to Plant's original values. As there are many parameters to contend with, and many other parameters whose temperature dependencies are considered secondary, it is difficult to find the best "temperature path" through parameter space. We have settled for a temperature path that yields reasonable variations of intraburst and interburst periods along with skipping behavior. There is still excessive burstiness, as seen in the
Encoding in Mammalian Cold Receptors
227
ISIHs of Figure 3 at higher temperatures. This could be remedied by spending more time adjusting the parameters, such as working with a lower value of G,. An adequate skipping pattern with many modes is found for T = 40°C with X = 4.0, p = 0.0017, G N a = 7.84, GK = 0.363, lp = -0.05, and D = 0.0025 with a correlation time T~ = 1.0. These parameters (except for D and T ~which , are kept fixed) are then extrapolated back to values below T = 20°C using their Qlos: X(T)
=
4(3)(T-4[))/'0
(4.8)
p(T)
=
0.0017(3)(T-40)/10
(4.9)
G ~( TJ) ~= 7.84(1.4)(T-40)"0 GK(T)
=
0.363(1.1)(T-40)''"
(4.10) (4.11)
We have found that the lowest temperature for which the sequence in Figure 1A can be obtained with this scheme is T = 17.8"C for X = 0.35, p = 0.00015, G N ~ = 3.7, GK = 0.294, and Ip = 0.03. For T < 17.8"C, the number of spikes per burst begins to decrease. lp is then given a linear dependence on temperature between its values for T = 17.8"C and T = 40°C: I,
=
-0.0036T
+ 0.094
(4.12)
Intermediate values of the parameters are then computed for T = 20, 25, 30, 35, and 37.5"C. Numerical integration is performed using a fourth-order Runge-Kutta method coupled to the integral algorithm of Fox et al. (1988) for the integration of the OU process. This method requires a fixed time step to allow proper sampling of the stochastic forces. Unfortunately, a fixed step method makes the stochastic simulations of the full model very long. The time step is 5 x sec, although results at higher temperatures, where the spiking dynamics are faster, are more accurate with 2.5 x The resulting firing patterns are described in the next section. 5 Simulated Firing Patterns
5.1 Comparisons with the Data. Simulated time series of the voltage variable for our "extended Plant model" are shown for different temperatures in Figure 2. By assuming that each upstroke reaching the threshold value 0 mV corresponds to a propagated action potential, a comparison between these solutions and the spike trains of Figure 1 is possible. The simulated ISIHs corresponding to the parameters used in Figure 2 are shown in Figure 3. The spike trains and ISIHs are in excellent agreement with those of Figure 1. They follow the same sequence as temperature is increased,
228
Andre Longtin and Karin Hinzer
50
T=17.8
50
20
20
-10
-10
-40
-w5' E
-
-70
2 E
50
W
-40 -70
50
Q
23 9 w 5
-10
J
-70
2
20
9
2o -10
W
f
-40
a:
-40
a
3
W
'
T=30
-70
50
50
20
20
-10
-10
-40
-40
-70
-70
0
25
50 TIME (SEC)
75
100
0
25
50
75
loo
TIME (SEC)
Figure 2: Time series of the membrane potential at different constant temperatures obtained from numerical simulations of Plant's model (equations 4.247) with electrogenic Na/K pump and noise. The experimentally measured patterns of Figure 1 are qualitatively reproduced (within a time-scaling factor of FZ 25) by varying the following parameters with temperature (see equations 4.84.12): the rate constant X of the two gating variables h and iz, and of the slow inward current s, the global rate constant for the dynamics of intracellular calcium concentration p, the maximal conductance of G ~ and J ~GK,and the electrogenic pump current Ip. The noise intensity is fixed a t D = 0.0025 for all temperatures. Its correlation time is r, = 1.0 msec. Integration time step is 5 x 1 O P sec.
and their statistics, compiled in Figure 4, are very similar to those for Figure 1A (see Braun et nl. 1980). The decay rate of the ISIH envelope is not as sharp in our simulation as it is in the data. A small variation in the temperature, or in the noise intensity, will produce an ISIH with a similar decay rate. It is worth pointing out again that one single fiber does not typically exhibit all the patterns in Figure IA, and that there is much variability across fibers (Section 1). A simulation at larger noise intensity such as D = 0.01 yields slightly broader ISIH peaks, and a bit of skipping is then also seen at T = 35°C (not shown). Further, the
Encoding in Mammalian Cold Receptors
229
i T=17.8
2ooo 1500
2
0
9W
1500
8
1000
W
m
5z
10
15
20
25
T=20
2
[r
5
2000
::::r 500
o
A
0
5
10
15
20
i
0
5
10
15
20
2000
T=25
500
0
0
5
10
15 20 25 INTERSPIKE INTERVAL (SEC)
Figure 3: Interspike interval histograms obtained from numerical integration of the Plant model at different temperatures. Parameters are the same as in Figure 2. The ISIHs are constructed from 100 realizations of the stochastic process (equations 4.2-4.7), each comprising 5 x lo5 time steps after transients have died out. In the deterministic case, the ISIHs have a finite number of singular peaks. The maxima of the bins corresponding to fast intraburst spiking are 4644 at T = 25"C, 6139 at T = 30"C, 9627 at T = 35"C, and 3651 at T = 40°C. number of counts at T = 40°C then increases in the lower modes, i.e., fewer oscillation cycles are skipped between firings (see Section 8.11. The interburst period shortens as temperature increases (Fig. 4A), causing the mean firing rate (computed from the inverse of the mean of the ISIH) to increase (Fig. 4B). At higher temperatures (i.e., for T > 37'0, the mean firing rate decreases again, d u e to skipping and to the decrease in the number of spikes per burst (Fig. 4 0 . The curve in Figure 48 is similar to ones published in Bade rt 01. (1979), although most of their mean rate curves decrease over a slightly larger (although variable) 5-10 degree range. Agreement would be even better if we could adjust the
230
Andri. Longtin and Karin Hinzer
0.41
0.0
,
,
,
20
,
,
I
1.0,
,
30 40 TEMPERATURE ("C)
,
20
~
.
,
30 40 TEMPERATURE ('C)
6.0
+
50
9m
4.0
v)
3.0
a
2
2.0
Y
h 1.0 0)
0.0
-
D=0.0025
MD=O
20
30 40 TEMPERATURE ("C)
0
2
4 6 8 FREQUENCY (Hz)
1
0
Figure 4: Statistics of simulated firing patterns as a function of temperature. (A) Number of spikes per burst; (B) burst frequency ( B F reciprocal of the interburst period) corresponding to the frequency of the subthreshold oscillation; (C) mean firing rate (reciprocal of the mean of the ISIH). The power spectrum of the spike trains is used to estimate BE An example of such a spectrum for T = 35°C is shown in (D). The power spectra are obtained by averaging the spectra from 100 realizations of the same duration as those described in Figure 3 for the ISIHs. The alias-free spectra with a flat spectral window were obtained function, acby convolving each delta-function spike with a sin(271%,t)/(27rfSt) cording to the method of French and Holden (1971). parameters in a way that did not produce slightly higher numbers of spikes per burst; nevertheless, the quasilinear shape of the number of spikes per burst vs. T (Fig. 4C) agrees with the experiment. The noise is seen to not significantly affect the mean number of spikes per burst, except above 35°C and below 17.8"C. The burst frequency in Figure 4A was measured from the power specthe sequence of spiking events, not the trum of the spike trains k@., voltage time series), averaged over many realizations of our stochastic model. The method of French and Holden (1971) was used to generate spectrally flat alias-free estimates of the power spectra. An example of such a power spectrum for T = 35 is shown in Figure 4D. Note that it
Encoding in Mammalian Cold Receptors
231
is not accurate to use the ISIH to estimate the frequency of the pattern when bursting is present. The interval corresponding to the first mode is then always shorter than the mean period of the oscillation since it lacks the contribution of the short intraburst ISI's. It is interesting to see how smoothly the frequency of the slow wave varies with temperature, as in the data of Braun et al. (1980) where it is also almost linear. This frequency probably conveys important information about temperature. Also, there is little difference between this curve and the one in the noiseless case, i.e., this frequency is very robust to noise. The mean amplitude of the slow wave increases with temperature, but decreases again at the high temperatures. This variation is small (Fig. 2), but together with the DC shifts due to the pump and the effects of the other thermosensitive parameters, it determines the different bursting and skipping patterns. We found that it is important to increase the ratio of G N to ~ GK with temperature. Doing so increases the number of spikes per burst. If it were kept constant, the active phase would shorten and drop out more quickly. In other words, increasing this ratio increases excitability, and slows down the progression through the sequence bursting-beatingskipping. In our model, the increase in the rate constants of the slow subsystem with temperature is responsible for the variation in the period of the slow wave, and thus of the bursting pattern (Fig. 4A). Increasing p by itself also decreases the period of the slow wave. In view of the excellent agreement between our model and the data, it is tempting to conclude, if indeed there are two slow variables, that p does have a Qlo comparable to that of the gating variables. At cold temperatures, our model predicts that the bursting activity simply ceases (around T = 13"C), after the number of spikes per burst has declined from its value at T = 17°C. The bursting period of R15 and other pacemaker neurons has also been found to increase, as in our model, as the cell cools down (Moffett and Wachtel 19761, and bursting ceases when the temperature is too low. It appears that cessation of firing in our study is a result of decreased excitability of the fast dynamics, since the slow wave amplitude is still large. Some solutions appear to be chaotic, with a random number of spikes per burst. While the low temperature patterns are stable to noise, spikes are randomly deleted by the noise as T decreases below 17.8 (with parameters varying according to equations 4.94.13). As mentioned in Section 1, 50% of fibers exhibit irregular bursts at low temperature, while the remaining ones are either silent or burst regularly throughout the low temperature range. Depending on whether the temperature is low or very low, our model can exhibit either regular or irregular patterns. It is likely that other effects also come into play, such as the deactivation of the Na-K pump at low temperature (Willis et al. 1974). Further investigation of this low temperature transition is warranted.
232
Andre Longtin and Karin Hinzer
The effect of increasing extracellular Ca'- on pacemaker cells can be very complicated, as it mav impact the dynamics of other currents. But assuming the main effect is an increase in Vca (by tlie Nernst formula), the result is a hyperpolarization of the slow wave. This in turn decreases the number of spikes per burst, effectively converting a bursting cold fiber into a nonbursting one as observed experimentally in Schafer r f (71. (1982). We have not investigated tlie potentially more significant effect of voltage screening due to this enhanced concentration.
5.2 Mechanism of Skipping. It1 this section, we discuss the mechanism of skipping in our model. The nonlinear dynamics of the Plant model have been studied by Rinzel and Lee (1987) using a decomposition of the full equations into slow variables ( s .Cai underlying the slow wave and fast variables governing the action potentials ( V .I Z . ~ ) . The slow wave is an autonomous oscillation (independent of spikes), which typically appears at a Hopf bifurcation in the slow subsystem. The active phase of the burst begins when the slow wave reaches the threshold voltage for the activation of the fast inward currents. The rapid firing during that active phase of the burst corresponds to a limit cycle in the fast subsystem. Rinzel and Lee's study emphasized that the likely mechanism for slow wave bursting involves a homoclinic transition rather than a Hopf bifurcation. Near this transition the period of firing varies strongly, and is infinite at the bifurcation point. In contrast, the firing period is finite at a Hopi bifurcation. When the slow wave is suprathreshold, the fast dynamics "riding" this oscillation periodically visit their threshold. Depending how much time they spend near threshold, i.e., on the rate at which the homoclinic curve is crossed, the duration of the ISIS can vary greatly. For the parameters chosen here as well as in Plant (1981), the IS1 increases during the active phase, which is also a property of the data in Figure 1. Next, we consider the case where the SIOW wave is subthreshold. Since this wave brings the fast subsystem periodically near the homoclinic curve, the probability of crossing the curve is also periodic. This periodically modulated probability underlies the skipping behavior. For tlie noise levels that produce multimodal ISIHs with reasonable widths, it appears that the phase preference is quite sharp as this firing probability closely parallels the amplitude of the slow wave. It is a fact, however, that if the slow wave did not exist, a precise time scale for the noiseinduced firings would still exist, even in the absence of any deterministic time scale. Such a noise-induced time scale has been shown by Sigeti and Horsthemke (1989) for systems near a saddle-node bifurcation. The ISIH is then gamma-like with a low IS1 cutoff, and a well-defined peak. The slow wave here has the effect of introducing a modulation on this basic ISIH, i.e., it makes it multimodal. The details of this mechanism of skipping will be published elsewhere.
Encoding in Mammalian Cold Receptors
233
If only the As are varied with temperature (with a Qlo of 3), a decrease in the period and number of spikes per burst is still observed (not shown). As temperature increases, the first spike in a burst becomes significantly higher than the others; at the same time the amplitude of all the spikes decreases. Skipping then arises because only the first spike is close enough to the propagation threshold. Then, with noise, the first spike may or may not propagate during a given cycle of the slow wave. But this is a more complicated and less likely mechanism for skipping. 5.3 Spectral Properties of Solutions. The power spectra of bursting neurons can be quite intricate, as is clear from Figure 4D. It is well known that the power spectrum of a repetitively firing pattern, modeled by a train of Dirac &pulses of period TO,
c a2
x ( f )=
S(f
(5.1)
-
is given by a set of delta functions at integer multiples of fo
=
l/To:
The highest peak in Figure 4D corresponds to the fundamental frequency of the noisy bursting solution. Its harmonics are visible, as expected for a periodic pulsed pattern (equation 5.2). Broad bumps are also seen. In the absence of noise, this and other bursting power spectra exhibit an even greater number of sharp peaks, with again broad bumps. This structure is similar to that seen in spectra of integral pulse frequency modulators (IPFM), for which Bayly (1968) has obtained an exact expression. These IPFMs are integrate-and-fire devices that, with constant input, fire at a precise frequency known as the carrier frequency fc. This carrier frequency is similar to the high frequency firing during the active phases, and corresponds to the large bump around 5 Hz. This bump is broad because the firing frequency varies during the active phase (see Section 5.2). When a n IPFM is driven by a frequency fm < fc, the spike train resembles that of a bursting neuron. This modulation frequency and its harmonics appear, producing sidebands on the carrier peaks. These harmonics are similar to the fundamental peak and its harmonics in Figure 4D. The spectra of noisy bursting neurons from Plant's model thus share features with IPFM spectra, but are more intricate. One can calculate from these spectra a signal-to-noise ratio at the fundamental frequency, and the dependence of this ratio on temperature. Preliminary results indicate that this ratio is very high for temperatures below 37"C, but drops significantly when skipping sets in. The characterization of the model spectra and behavior of the signal-to-noise ratio, as well as comparisons to those estimated from the experimental data, will be reported elsewhere.
Andre Longtin and Karin Hinzer
234
6 Sources of Pattern Variability
-
Section 4.3 describes a path through parameter space that yields the sequence of bursting to skipping patterns observed in Figure 1A as the temperature increases. Neighboring paths may or may not yield qualitatively similar results. It is important to understand what other dynamic behaviors exist near this path, because noise will allow the system to sample these behaviors. Exploring the vicinity of this path thus yields information on the sensitivity of the patterns to parameter variations. This in turn indicates how observable a pattern should be in the presence of additive or multiplicative noise. In other words, this exploration helps determine the "volume" of the path corresponding to the observed sequence. This section focuses on sensitivity to noise and parameter variations. Results may shed light on the origin of aperiodic firing for a given cell. Further, they may explain the variability in activity across fibers in the same preparation, and across preparations (since different cells may have different parameter values). I t is known, for example, that other receptors, such as the cold fibers oi the ampullae of Lorenzini, exhibit different sequences of bursting, beating, and skipping as temperature increases (Iggo and Iggo 1971; Braun ct (11. 1984a). But their basic ionic mechanisms may be quite similar to those of cat lingual cold receptors. If this is the case, their firing patterns may arise as temperature parametrizes a different path through parameter space (due to different pump activities, ionic concentrations etc.). 6.1 Critical Slowing Down at High Temperature. We first focus on the intluence of noise in the higher temperature range. As temperature increases i n the noiseless model, the amplitude of the slow wave first increases, and then decreases for T > 35-C (this is barely visible in the simulations with noise of Fig. 2 ) . At these higher temperatures, the slow wave becomes further hyperpolarized due to the N a / K pump. This downward shift moves the slow dynamics closer to a Hopf bifurcation at which the slow wave disappears and the slow dynamics converge to a stable fixed point. Since the limit cycle disappears, this bifurcation is sometimes called a reverse-Hopf bifurcation. This effect of I , is illustrated in the left panels of Figure 5, where for simplicity we have fixed all other parameters to their values at 7 = 4f3-C.This Hopf bifurcation of the slow dynamics should be distinguished from the homoclinic bifurcation of the fast dynamics, at which the fast spiking arises (Section 5.2). The slow wave frequency varies slowly across the Hopf bifurcation. In Figure 5, the Hopf bifurcation occurs a t I , = -0.068. By comparison, at T = 25 C, we used I, = 0.004 in Figure 2, a value well beyond that at which the Hopf bifurcation occurs for this temperature (I1, = -0.08). As can be seen with I , = -0.0675, the decay time of the slow wave to its
Encoding in Mammalian Cold Receptors
235
0
2
0
4
0
6
o
e
Q
1
0
0
TIME [SEC)
Figure 5: Critical slowing down: increased effect of noise on firing pattern at T = 40°C as I , decreases. Left panels: deterministic case. Slow-wave amplitude decreases as I , decreases. 1, = -0.0675 is just above the bifurcation point value (-0.068). Right panels: stochastic case with D = 0.0025. D and r, are the same in each plot. However, the slow wave is increasingly perturbed (the variance of the amplitude becomes larger than the mean amplitude)as the Hopf bifurcation is approached. Only the I , = -0.04 case has spikes in the absence of noise. asymptotic amplitude is quite long. In fact, it increases as the bifurcation point is approached, and is infinite at the bifurcation point itself. This loss of stability implies that noise has a greater influence on the solution as I, decreases, even though the noise intensity is constant. This is shown in the corresponding stochastic simulations in the right panels of Figure 5. This apparent amplification of fluctuations as a bifurcation point is approached is known as "critical slowing down" (see, e.g., Horsthemke and Lefever 1984). The amplification of noise is most obvious for I, = -0.0675. What this finding implies is that, even though the noise level is assumed constant, the effect of noise will be higher at high temperature. Consequently, the firing probability increases. For example, in going from I , = -0.05 to Ip = -0.06, the slow wave has become slightly more hyperpolarized, and its amplitude has decreased. These two deterministic effects conspire to abolish all spiking. Nevertheless, spiking is still seen on some cycles at Ip = -0.06, because the "effective" noise intensity is now larger, due to the loss of stability. If the noise were sufficiently intense, skipping could arise even though I , was below the value at which
Andre Longtin and Karin Hinzer
236
the slow wave comes into existence. If I , only shifted the slow wave downward without bringing the slow dynamics nearer to the Hopf bifurcation, the amplitude and stability of the slow wave as well as the "effective noise intensity" would change only slightly. Thus, both noise and critical slowing down contribute to extending the encoding range, by allowing spikes to occur over a broader range of physiological parameters. This critical slowing down could occur at other kinds of bifurcations than the supercritical Hopf bifurcation present here, although the implications for encoding may then be different. 6.2 Period Doubling, Chaos, and Skipping. The model exhibits other dynamical behaviors than those discussed u p to now. These behaviors occur for parameters in the vicinity of the path defined by equations 4.94.13. For example, period-doublings leading to chaotic motion occur for the T 20°C parameters as I , increases slightly. Since it is difficult to visualize the bifurcations and chaotic motion from the full bursting solution, we have used a first return map representation of the ISI's (Fig. 6). The iirst panel for I, = 0.022 corresponds to the noiseless version of the firing pattern at T = 20°C. We see that it is in fact the first period-doubled solution of a fundamental bursting pattern occurring for I , < 0.022. However, the presence of noise produces a pattern with a spectral peak centered on that of the fundamental solution (not shown). The chaotic motion at I , = 0.03 is manifested in the variable number of spikes per burst from one cycle to the next. Further, there is the issue of multistability, a property found in other models (Chay and Kang 1987; Canavier t>t ol. 1993). If multistability exists in our model, noise can perturb the dynamics from one kind of motion to another coexisting motion. Noise may thus cause a random sampling of different simple and complex patterns, along with their transients. There is also another kind of chaotic motion, occurring at a higher value of Jp, which can lead to skipping when a small amount of noise is present. This is illustrated in Figures 7 and 8. This chaotic motion is closely related to that studied in Section 7.2 below. As I , increases, the number of spikes per burst and the amplitude of the slow wave decrease. At some point, the depolarization is not sufficient to cause spiking. Near this point, skipping can arise through stochastic forcing of the chaotic motion (Longtin 1995a). Due to the low amplitude of the slow wave, this form of skipping is not as sharply phase-locked as that at T = 40°C in Figure 3. This is seen by comparing the ISIHs of Figure 3 to the one in Figure 8. It is also apparent that short bursts are sometimes associated with this kind of skipping. Thus, depending on the precise balance of hyperpolarizing and depolarizing influences, skipping may be seen in a given preparation at lower temperatures than in Figure 1 (i.e., lower than 35" or so). This may explain some of the differences between the firing patterns of cold-sensitive fibers of ampullae of Lorenzini, boa warm receptor, and cat cold bursting and nonbursting receptors (Braun et d.1984a). ~~
Encoding in Mammalian Cold Receptors
237
1 1
10
ISI,
Figure 6: First return maps of interspike intervals at three values of the pump current lp in the absence of noise. The other parameters are those used for T = 20°C in Figure 2. As the successive ISIs vary widely, a connected log-log plot was used to represent the solutions. A period-doubling cascade occurs as Ip increases, with chaotic bursting when lp = 0.03. These behaviors are also found at other temperatures.
6.3 Noise-Induced Bursting from a Beating State. Figure 9 presents another kind of firing pattern that may be relevant to the question of variability across preparations. This noise-induced bursting occurs for T = 40°C with a high value of Ip rather than the low one (-0.05) used in Figure 2. In the noiseless case, the slow wave amplitude and frequency decrease as Ip increases, while the duration of the active phase increases due to a growing asymmetry in the shape of the oscillation. At I, = 0.039 the successive active phases merge, and high frequency beating Le., periodic firing) ensues. When D > 0 and Ip = 0.04, the slow wave that exists for I , < 0.039 becomes "sampled" by the noise: bursting is induced
Andre Longtin and Karin Hinzer
238
20 -10 -40
5 -70 1
5 z W
20
2
-10
IIW
2m
5
-40
-70
2 20 -10
-40 -70 0
20
40
60
a0
TIME (SEC)
Figure 7: Bursting patterns a t T = 23 C 2s I , increases. When D = 0, the number of spikes per brirsi as tvell as the peak t o peak ampIitude of the slow ~ ‘ a are w reduced as 1 , increases. The solution with I , = 0.06 and D = 0 has c~ very long period and i5 probabl!7 chaotic. At I , = 0.06 and D = 0.0025, a skipping pattern appears.
by the noise. The mean interburst period decreases as D increases over
Encoding in Mammalian Cold Receptors
flz
239
t 60
I
W
> w
IL
0
40
a:
W
m
f. 20
z 0
0
20
40
60
cn I-
2
9w %
100
U W
m
5z
50
n
"0
20 40 INTERSPIKE INTERVAL (SEC)
60
Figure 8: ISIHs for noise-induced skipping in the vicinity of chaotic motion (refer to Fig. 7 )for T = 25°C and Ip = 0.06. Note that the distribution of intervals when D = 0 is continuous rather than singular: the solution is probably chaotic. The modes of this ISIH are considerably broader than those shown in Figure 3.
7 Deterministic Skipping and
Gslow
In our model, temperature increases the rate of the activation kinetics of the slow inward current. In this section, the effect of also varying its maximal conductance G,, (Gslow in the following) is described. This can lead to a form of skipping without noise (deterministic skipping). The possibility that the skipping seen in cold receptors is of deterministic origin, and that the sequence in Figure 1A is mostly determined by Gslow, is discussed in the context of the recent model of bursting by Chay and Fan (1993). Studying the effect of Gslowalso suggests possible mechanisms for the paradoxical cold response.
Andr6 Longtin and Karin Hinzer
240
I
Ip=0.03 D=0.0025
I
I
1 ~ 0 . 0 4D=O
II
30 0
-30 -60
5 Iz
-60 I ~ d . 0 4D=le-5
30
gw o z
$
-30
9W
-60
5
Ip=0.04 DS.0025
1
I
30 0
-30 -60
0
10
20 TIME
30
40
50
(SEC)
Figure 9: Noise-induced bursting at high I , for T = 10T.When D = 0, tlie duration of the active phase increases with IF. At I , = 0.039, the successive active phases merge. When D j 0 and I , = 0.04, the slow wave that exists for Ip < 0.039 is sampled by the noise. Near this bifurcation, tlie noise sets the mean time scale of the bursting i t induces. Variations i n spike heights are a plotting artifact due to decimation of the large number of points required to represent a solution. 7.1 Thermosensitivity of G,,,,,, . It has been reported that the slow inward current responsible for tlie negative slope resistance of pacemaker cells is dependent o n temperature (Wilson a n d Wachtel 1974; Adams a n d Benson 1985). This means that its activation rate a n d / o r its inactivation rate a n d / o r its maximal conductance may vary with temperature. In o u r model, the kinetic rate oi activation was given the same Qlo as that of the other activation variables (equation 4.6). This current was chosen a s noninactivating, as discussed in Section 4.1. Its maximal conductance G,,,,,.
Encoding in Mammalian Cold Receptors
241
( G , in equation 4.2) was kept constant, as its variation with temperature has been considered secondary to those of G N a and Gh, as discussed in Section 3. It has also been reported (Nobile et al. 1990) that the calcium currents in chick dorsal root ganglion neurons (containing the cell bodies of different kinds of sensory neurones) can have high QKIS. LVA-type channels have lower Qlos than those of HVA type. The reported values for LVA are 1.7 for maximal conductance, 1.9 for activation, and 2.2 for inactivation. The channels gating the slow currents in cold receptors have been reported to have characteristics that are more of the LVA than HVA type (Schafer ef 01. 1988). The firing patterns produced by our model are sensitive to Gslow. Varying this parameter along with the other parameters produces some correspondence with Figure lA, especially if G,,,, starts at a lower value. However, the range of correspondence is shortened. It is likely, if Gslow does indeed vary with temperature, that a more elaborate parameter variation scheme is required to reproduce the sequence. The sensitivity of the model to G.J,,,~may then partly explain why fibers usually do not exhibit the whole gamut of firing behaviors shown in Figure 1A. Figure 10 shows the effect of increasing Gslowat T = 40°C. An increase from 0.01 to 0.011 produces a transition from skipping to bursting. This bursting is deterministic since it occurs also when D = 0. A further increase in Gslow to 0.012 produces a merging of the active phases (as in Fig. 9), and high frequency beating ensues. It is tempting to draw an analogy between this renewed firing at high temperature and the paradoxical cold response (Hensel 1974). This response of cold fibers to warm temperatures normally occurs after cessation of firing. Increases in Gslow could then be involved in the increased mean firing rate after cessation around 45°C. Comparison with data is not possible at present as the temporal firing patterns of this paradoxical response have not been studied in detail (Hans Braun, personal communication). Another interesting finding is that our model can produce deterministic skipping for smaller values of Gslow. This is shown in Figure 11. This occurs over a narrow range of values of Gslow. This skipping is very sharply phase-locked. Addition of noise to the dynamics produces an ISIH with a smoothly decaying envelope, as seen in the data. This model behavior, found at different temperatures, may account for some of the aperiodicity and response variability across fibers.
7.2 The Bursting Model of Chay and Fan. The Chay and Fan (1993) (CF) model of bursting was motivated by the search for drug treatments of certain irregular activities in the brain. We have chosen to study this model because it points to other possible mechanisms for the transitions between firing patterns in Figure 1A. In particular, it suggests that chaotic dynamics may underlie the skipping behavior. This model has a slow
242
Andre Longtin and Karin Hinzer
50 10
-30
9 E w
-70
v
s:
50
3
10
z
-30
9w
sm
I -70
W
5
t
50
Gx=0.012
1
10
-30 I
I
1
-70
0
5
10
15 TIME (SEC)
20
i
25
Figure 10: Effect of increasing the maximal conductance G, of the slow current in equation 4.2 with temperature. Other parameters are a s in Figure 2 for T = 40°C. Deterministic bursting, followed by high frequency beating, can be recovered from the skipping regime by increasing G,. This behavior may contribute to the paradoxical cold effect. inward current Islowgiven by
(7.1) (V - Vs~ow) The activation variable d and the inactivation variable f are voltage- and time-dependent: dy (7.2) = b,(V) - y1 l.,(V) Is~ow= Gslowdf
dt
~
where y stands for either if or f . It also has Hodgkin-Huxley-type fast action potential dynamics:
(7.3)
Encoding in Mammalian Cold Receptors
500
243
500 T=35 D=O
cn
400
400
300
300
200
200
100
100
I
T=35 D=0.0025
5
2 8 d
g
5 z 0
fl
0
10 20 30 INTERSPIKE INTERVAL (SEC)
“0
10 20 30 INTERSPIKE INTERVAL (SEC)
Figure 11: Deterministic skipping when the maximal conductance G, in equation 4.2 is lowered from its value of 0.01 (used up to now) to 0.009548. Other parameters are as in Figure 2 for T = 35°C. Left: the skipping occurs in the absence of noise, near the onset of bursting. Right: Deterministic skipping in the presence of noise (D = 0.0025).
where n and k are also governed by equation 7.2. The time constants are T , ~= 0.0, Q = 0.2, r,, = 0.3, r d = 1.0, and 7 = 40.0 (c = 1). Thus the activation of Islowis slower than the kinetics of n and k, but 40 times faster than the inactivation of Islow. The CF model, slightly modified from previous models studied by this group, is also of the slow wave bursting-type with only one dominantly slow variable, as opposed to two in Plant’s model. To our knowledge, an analysis of the CF model in terms of fast and slow submanifolds and pseudo-steady states, as Rinzel and Lee (1987) have done for Plant’s model, has not been published. When Ifast= 0, the five-dimensional CF model undergoes a Hopf bifurcation to a low-amplitude slow-wave oscillation as Gslow reaches a value near 10, and a reverse Hopf bifurcation when Gslow reaches a value near 16.5 (Chay and Fan 1993). This slow wave underlies the bursting pattern when Ifast# 0. The fast dynamics and the activation kinetics of Islowin CF are similar to those in Plant’s model. In this latter model, Islow does not inactivate. Rather, this current turns off when the voltage decreases due to the calcium activated K+ current. Rinzel and Lee’s (1987) analysis of Plant’s model shows that significant qualitative changes in behavior are not expected if instead one assumes calcium inactivation of Islow.The CF model differs from these two alternatives in that the inactivation directly and
244
Andre Longtin and Karin Hinzer
solely depends on voltage and time. However, the main and important difference between the CF model and that of Plant is that [Ca2+],is not a state variable in CF. Thus the CF model applies to preparations in which [Ca2+]iis not thought to play an essential role in the genesis of bursting. Figure 8 of Chay and Fan (1993) shows a sequence of firing patterns for increasing Gslow over a small range. This sequence is similar to that seen in Figure 1A. While it is not surprising that other models of slowwave bursting give rise to similar transitions, it is of great interest that this model can produce skipping without noise. Given that Gslou. may be temperature dependent (Section 7.11, this raises the interesting possibility that the firing patterns of cold receptors are largely determined by variations in Gslow.While this has not been proposed as a primary mechanism in the literature on cold receptors (summarized in Section 3), we feel nevertheless that it should be seriously considered. This is further warranted by the fact that, although the literature on cold receptors strongly suggests that [Ca'+Ii does play a role in bursting, its involvement has not been solidly confirmed. Consequently, it would be worthwhile to investigate this model in the context of what is known about mammalian cold receptors, i.e., by varying all of the putative thermosensitive parameters, and not just GSlOMr. We do not attempt a full analysis of the CF model using a parameter variation scheme as in Section 4.3. Rather, we consider the transition from beating to skipping, and compare the solutions and ISIHs to those in Figure 1. We have constructed the ISIHs for five values of Gslow in the range of interest, both without noise, and with a moderate amount of noise (D= 10: the scaling is different in CF, thus the higher values of D). The results are shown in Figure 12. The main features of beating and skipping are visible in the ISIHs obtained with D = 0. The skipping solutions appear indeed chaotic (not shown). Both beating and skipping are accompanied by significant bursting; this can probably be removed by parameter adjustment. There are, however, clear differences with the experimental data. The simulated ISIHs have more structure than those in Figure 1, and exhibit less phase-locking. The structure for D = 0 is due to the chaotic motion. It is partly smoothed out by noise (Fig. 12, right panels). However, some structure beyond that seen in the data still remains despite the presence of noise, such as the asymmetry and splitting of the first mode associated with the slow wave period. The reason why phase-locking decreases as Gslow increases in the CF model is that the slow wave amplitude is decreasing to zero. This decreased phase-locking is similar to that seen in our Figures 7-8 for noise-induced skipping near chaotic motion, at which the slow wave amplitude is small. In contrast, our model produces multimodal ISIHs with the proper structure and a good degree of phase-locking (the peaks are very clearly separated, as in the data). This is because the pump current shifts the slow wave downward, and noise-induced bursting occurs when the slow
Encoding in Mammalian Cold Receptors
D=O
245
D=2.5
750 500 250
0 750 500 250
z
W
o 750 500
6 250 g o 750 500 250
0 750 500 250
0
0
25
50 75 0 25 50 INTERSPIKE INTERVAL (MSEC)
75
Figure 12: Deterministic skipping in the slow wave bursting model of Chay and Fan (1993) as their maximal conductance Gslow varies over a small range. Left panels: D = 0. The progressive loss of spikes follows the decrease in amplitude of the slow wave. This is accompanied by loss of phase-locking. Most solutions in this range appear to be chaotic, producing multimodal ISIHs with a more complicated structure than those for the noise-induced skipping case (Fig. 3, T = 40°C). As Gslow increases above 16.0 the spikes disappear. Right panels: D = 2.5, r, = 0.01. The ratio of T~ to the fastest time constant in CF is similar to that used for our stochastic simulations of Plant’s model. The structure in the noiseless ISIHs is partially smoothed out by the noise.
wave amplitude is large. Further, for slight changes in the calcium concentration, the experimental ISIHs have many peaks (up to eight), a n d there is still sharp phase locking. It is difficult to see how chaotic skipping riding a low-amplitude slow wave as in the CF model could produce this effect. Our model can easily produce skipping ISIHs with many modes.
246
Andre Longtin and Karin Hinzer
If G,,,,% is lowered below 14.0, the CF solutions go through an inverse period-adding sequence, in which a bursting solution bifurcates to another bursting solution with one less spike per burst. This behavior is different from the one seen in Figure 1A as temperature decreases from T = 30-C. Also, the proper variation of the bursting periods is not reproduced. It is expected that concomitant variation of the kinetic rates, especially those for G,,,,, , is necessary to produce proper variations in the slow wave period. This will perhaps also produce more sharply phase-locked deterministic skipping. We conclude that it would be very interesting to pursue the study of the CF model in the context of cold thermoreception. It is likely that other parameters have to be varied along with G,,,,,., and that noise has to be coupled to the dynamics, if this model is to agree with the data to the extent that our extended Plant model does. The appeal of the CF model lies in its conceptual simplicity compared to that of Plant, since intracellular calcium dynamics are not present (both models nevertheless have five dynamical variables). It should be mentioned that CF also gives a paradoxical cold response as G,,,,,. is increased past 17.0. The effect of noise on the CF model is further discussed at the end of Section 8.1. 8 Role of Noise in Skipping and Coding
Our study suggests that noise accounts for much o f the aperiodicity observed in the firing patterns. However, chaotic bursting and skipping as well as the effect of noise in the vicinity of bifurcations cannot be ruled out, especially as bifurcations and chaotic motion are common features of models of bursting (Chay L’t nl. 1995). Whatever the source of aperiodicity, the fact remains that this aperiodicity probably plays a role in the encoding of stimuli. 8.1 Subthreshold and Suprathreshold Skipping. The interaction of noise with a subthreshold slow wave can produce skipping. This form of skipping arises froin noise-induced phase-locking, as in Figure 5 with I , = -0.05. In our model, this occurs for 7 > 37’C. Skipping can also arise when noise perturbs a deterministic phase-locked pattern, as in Figure 5 with I , = -0.04. In this case, the slow wave is suprathreshold since firing occurs without noise. In our model, this occurs for T < 37°C. This form of skipping has also been found for the stochastic version of the Fitzhugli-Nagumo neuron equations with periodic forcing by Longtin (199%). This latter study shows that it is possible to experimentally distinguish between the two forms of skipping if the noise level can be varied, e.g., by using an external noise source as in Douglas et nl. (1993). I n the subthreshold case, increasing the noise will always cause the IS1 probability to spread to lower multiples of the basic period, i.e., it will reduce skipping. This property of the noise-induced skipping occurs
Encoding in Mammalian Cold Receptors
247
because a larger noise reduces the escape time to the firing threshold. In the suprathreshold case, increasing D will first perturb the phase-locked pattern, with ISIs spreading out to the higher modes of the ISIH. Past a certain value of D, the ISIs will move back to the lower modes. For example, one way to obtain skipping similar to that seen at T = 40°C is to increase the noise intensity at T = 35°C (the basic periods will of course be different). While the period of the T = 35 pattern will not change, there will be a spread of the probability to larger ISIs, characteristic of the suprathreshold case. Hence, the transition from bursting to skipping is not clear cut, in the sense that skipping does not necessarily imply a subthreshold oscillation. But it is clear that noise increases the range of parameters where firings can occur, and thus extends the encoding range. In the case of deterministic skipping studied in the CF model (Section 7.2), preliminary results indicate that the effect of noise is not systematic. For example, noise will slightly increase skipping for Gslow = 15.25 and 15.5, but not for the other values investigated. These results suggest that the method of distinguishing between different origins of skipping using noise can be extended to the deterministic skipping case.
8.2 Sensitivity to Stimuli and Noise. Wiederhold and Carpenter (1982) have suggested a role for regular firing patterns such as bursting from the point of view of sensory encoding. They argue that sensory cells might avoid regular spontaneous firing (such as bursting) if they are to encode stimuli at frequencies close to that of the regular activity. For example, if an auditory cell fired regularly (they fire very irregularly), a stimulus at a frequency near its mean spontaneous frequency could not easily change this mean rate; the encoding capability would be diminished. In contrast, cells with regular spontaneous firing (such as bursting) could encode weak stimuli through the modulatory effect of these stimuli on the regular activity. Thus, the stimuli would not have to exceed threshold to be encoded, since the cell is already biased into a suprathreshold region. Our study of mammalian cold receptors suggests an interesting expansion on this point. Weak temperature stimuli are readily encoded through their effect on the bursting period, the duration of the active phase, and the number of spikes per burst. The dynamic response (i.e., transients) further enhances this sensitivity (Braun et al. 1990). This sensitivity is also present at high temperature, even though the slow wave is subthreshold. This has been shown in recent theoretical (Longtin 1993; Chialvo and Apkarian 1993) and experimental (Douglas et al. 1993) studies of neurons driven by periodic forcing and noise. When such neurons are biased into their subthreshold regions, noise can enhance the expression of a small periodic signal through an effect known as "stochastic resonance." This occurs when the time scale of firing imposed by the
248
Andre Longtin and Karin Hinzer
stimulus becomes commensurate with the mean firing time in the absence of stimulus. Neurons that exhibit this effect also exhibit skipping. Further, characteristics of the ISIH, such as the rate of decay of the envelope, are very sensitive to parameters such as stimulus amplitude, frequency, and noise intensity (Longtin r fa / . 1994). The multimodal ISIHs in our model are also \ w y sensitive to, e.g., the static temperature and the noise intensity, even though the ”periodic signal” is endogenous rather than external. The noise helps express the frequency of the underlying slow wave when it is subthreshold. Thus, the sensitivity of regular activity can extend to skipping. 8.3 Deterministic versus Stochastic Coding. The issue of whether stochastic coding or temporal coding is used by the brain is a burning question, especially when it is addressed to cortical information processing (see Shadlen and Newsome 1994, for a current review; Usher ct nl. 1994). Trying to answer such questions requires that a precise nieaning be ascribed to ”precise timing” and “stochastic.” I n the case of cold receptors, our study suggests that the code combines deterministic and stochastic components. The precise timing of spikes is seen in the predictability of firing times that characterize cyclical patterns such as bursting and beating. However, there are fluctuations within these patterns in the exact times at which the firings occur. For example, the interspike intervals in a burst do not repeat exactly from one burst to the next; likewise, the number of spikes per burst and the time between bursts fluctuate. A t high temperatures, the precise timing is seen in the persistence of the phase-locking to the slow wave, but the number of cycles between firings is random. The probability of firing may itself be part of the code, as suggested by Sclieicli r t 171. (1973) in the context of skipping cells known as “probability coders” in weakly electric fish. Modeling of the next stages of processing of thermal information all the way up to the hypothalamus may be needed before a clear understanding of the interplay of deterministic and stochastic aspects of the code is achieved. If the skipping pattern is indeed relevant to the coding by cold receptors, then our model suggests that noise is an important component of this code. It endows the cold receptor with a continuous variation in firing pattern as temperature varies (this occurs also by smoothing out, e.g., period-doubled solutions as in Figs. 2 and 6). Without noise, there would be no skipping in our model over the 5-1O’C range where it is measured. Clearly, too much noise would destroy the multimodal pattern. In the deterministic skipping case, an optimal amount of noise also appears to be needed to produce a smooth JSIH. Thus, this sense may have accommodated to an amount of noise that allows a sufficient dynamic range along with a reasonable signal-to-noise ratio. The precise sense in which noise could be used optimally by cold receptors will be
Encoding in Mammalian Cold Receptors
249
investigated elsewhere. Suffice it to say that this encoding scheme at higher temperatures is similar to that seen over a wide range of stimulus amplitudes in other thermal noise-limited senses such as the auditory system. 9 Conclusion 9.1 Summary of Results. 0
0
0
0
0
0
Our model of mammalian cold receptors accounts for the main temporal and statistical features of the sequence of firing patterns shown in Figure 1A (Section 5.1). It is necessary to vary seven parameters concomitantly to obtain this agreement. The model incorporates the putative thermosensitive mechanisms discussed in the physiological literature on cold receptors (Section 3). Based on Plant's (1981) five-dimensional ionic model of bursting in the R15 pacemaker cell of Aplysia, the model provides a framework for the oscillatory theory of transduction by these receptors (Braun et al. 1980, 1984b). Skipping arises here through noise-induced firing from a subthreshold slow wave oscillation in the receptor. We have studied the variability (Section 1) of the firing patterns seen across fibers, and across preparations, by exploring behaviors of our model in the vicinity of the parameter path defined by equations 4.9-4.13. Section 6 reports our findings on noise-induced beating from a bursting state, on period-doubling sequences, and on noise-induced skipping from a chaotic low amplitude slow wave. Our assumption of constant noise intensity is compatible with the increasing importance of noise at higher temperatures. This is due to the loss of stability of the slow wave at high temperature (Section 6.1). Our model suggests that spikes drop out at higher temperatures as a consequence of hyperpolarization of the slow wave (Section 5.2). It is known that action potentials can also be quenched at high temperatures ("heat block": see Hodgkin and Katz 1949; Huxley 1959). This occurs when the rate of rise of the spike cannot keep u p with the rates of change of the permeabilities that lead to recovery. Skipping does not appear to be a form of intermittent heat block. At very low temperatures, noise has an increased effect on pattern variability, as it affects the number of spikes dropping out of the bursting pattern (Section 5.1). The physiological evidence for our model is indirect as it derives from extracellular recordings. In view of the diversity and complexity of the ionic dynamics underlying the firing patterns of bursting cells (Adams and Benson 1985; Canavier etal. 1991; Chay etal. 1995), it would not be surprising that other models with different currents
Andre Longtin and Karin Hinzer
250
and/or temperature effects neglected here could also reproduce the data. Our model provides a framework from which to proceed for studying thermoreception. It can easily accommodate new physiological data as they become available. 0
0
0
0
0
Our work provides a good starting point for studying the interaction of pacemaker dynamics with noise. Further, in the event that noise is at the origin of skipping, the discussion of the role of noise in Section 8 will likely survive the precise details of future improved ionic descriptions. In our model, the effect of temperature on the slow inward current was to increase the rate of the activation kinetics, as the literature on cold receptors suggests that variations in G,I,,,. have a secondary importance. Incorporating variations of G,l,,,,. with temperature in our model requires more assumptions on the behavior of other parameters (Section 7.1). Our study discusses an attractive alternate mechanism for the sequence shown in Figure lA, based on the results of Chay and Fan (1993) (Section 7.2). We have investigated the behavior of their model of slow wave bursting as G,I,,,,. increases to produce a transition from beating to skipping. The skipping ISIHs exhibit more structure than those in Figure 1. These chaotic solutions likely require stochastic forcing to produce smoother ISIHs. Also, the deterministic skipping is less phase-locked than the stochastic skipping in our model. A better assessment o f their model would require a full study of its dynamics as many parameters are varied along with G.,lOl,, following a scheme similar to that in Section 4.3. In our view, such a study would be of great interest. Deterministic skipping also occurs in our model at slightly smaller values of G,. The ISIHs are strongly phase-locked, and their envelope can be nonrnnnotonic. Noise makes the ISIH envelope monotone decreasing, similar to those in Figure 1. Paradoxical cold responses can be obtained in both our model and that of Chay alid Fan (Section 7.1).
9.2 Future Work. 0
0
An improvement to our model would include the increase with temperature of all the maximal conductances in equation 4.2. Other parameters may have to change also, such as the Qlos, or the temperature dependence of the pump. The precise form of these changes would have to be surmised from other preparations. An important next step is to model the dynamic responses, i.e., the transient responses to temperature changes. These are well documented, and may serve to validate models. This would require
Encoding in Mammalian Cold Receptors
0
0
0
0
0
251
proper modeling of the transient behavior of the electrogenic pump currents. The patterns of Figure 1 were reproduced by increasing the global rate p of the intracellular calcium kinetics. Perhaps it is sufficient to only vary the rate of calcium sequestration. Preliminary results indicate, however, that this is not the case. One can also think of more detailed modeling of the intracellular calcium kinetics as in, e.g., Canavier et al. (1991) or Chay ef al. (1995). One can test other hypotheses for noise that involve, e.g., changing D and T~ with temperature. The inclusion of conductance fluctuations, e.g., as in Chay and Kang (1988), is an obvious first step. The effect of r, should also be investigated, as it can affect the correlations between the skipping events (Longtin et al. 1994). Temporal properties (such as correlations) of the experimental spike trains, near and in the skipping regime, should be compared with those of the simulated spike trains discussed in our paper. Such analyses could include return maps of ISIs and spectral analyses. Externally imposed noise could alter the skipping behavior of the receptor (Section 8.1). These changes could be compared to those predicted by models. Multistability (Chay and Kang 1987; Canavier et al. 1993, Chay and Fan 1993) may underlie some of the observed variability in cold receptors. It is worthwhile investigating this possibility.
Acknowledgments This work was supported by NSERC Canada, and NIMH (USA) through Grant R01 MH47184-01. The authors wish to thank Leonard Maler, John Rinzel, Wendy Brandts, and Michael Guevara for useful discussions. We would like to thank an anonymous reviewer for suggesting the relevance to our study of the results of Chay and Fan (1993). References Adams, W. B., and Benson, J. A 1985. The generation and modulation of endogeneous rhythmicity in the Aplysia bursting pacemaker neurone R15. Prog. Biuphys. Mol. Biol. 46, 1 4 9 . Bade, H., Braun, H. A,, and Hensel, H. 1979. Parameters of the static burst discharge of lingual cold receptors in the cat. Pfliigers Arch. 382, 1-5. Barish, M. E., and Thompson, S. H. 1983. Calcium buffering and slow recovery kinetics of calcium-dependent outward current in molluscan neurones. I. Physiol. 337, 201-219. Bayly, E. J. 1968. Spectral analysis of pulse frequency modulation in the nervous system. I E E E Trans. Biu-Med. Eng. 15, 257-265.
252
Andr6 Longtin and Karin Hinzer
Braun, H. A,, Bade, H., and Hensel, H. 1980. Static and dynamic discharge patterns of bursting cold fibers related to hypothetical receptor mechanisms. P f 7 i i g ~Arch. ~ 386, 1-9. Braun, H. A,, Schiifer, K., Wissing, H., and Hensel, H. 1984a. Periodic transduction processes in therniosensitive receptors. In S m w y Rcwytor- M L ~ I ~ I I ~ S I ~ ~ S , Lv. Haniann and A. Iggo, eds., pp. 147-156. World Scientific, Singapore. Braun, H. A,, Scliafer, K., and Wissing, H. 1984b. Theorien und Modelle zum i;bertragungs~.erlialten thermosensitiver Rereptoren. Frrrrkt. Biol. M c d . 3, 26-36. Braun, H. A,, Schater, K., and Wissing, H. 1990. Tlieories and models of temc ~ iT/re~ft?~)~l,g~rlr?ti[)fi, fioff J. Bligli and perature transduction. I n ~ / i l ~ r r r ? o ~ ~ ~ cnmf K. Voigt, eds., pp. 19-29, Springer Verlag, Berlin. Wissing, H., Scliifer, K., and Hirsch, M. C. 1994. Oscillation Braun, H. ,4., and noise determine signal transduction in shark multimodal sensory cells. "Vitrrw ( I . ( i i r d o / r ) 367, 270-273. Canavier, C. C., Clark, J. W., and Byrne, J. H. 1991. Simulation of the bursting activity of neuron R15 in Ap/!/sin: Role of ionic currents, calcium balance, and modulatory transmitters. 1. b.krfmp/iysio/. 66, 2107-2124. Canavier, C. C., Baxter, D. A,, Cl'lrk, J. W., and Byrne, J. H. 1993. Nonlinear dynamics in a model neuron pro\kie a novel mechanism for transient synaptic inputs to produce long-term alterations of postsynaptic activity. J . NwropIiy~ioI.69, 2252-22.57. Carpenter, D. 0. 1967. Temperature effects on pacemaker generation, membrane potential, and critical firing thresliold in Ap/ysia neurons. /. Gt,/r.Physiol. 50, 1469-1 484. Carpenter, D. 0. 1981. Ionic and metabolic bases o f neuronal thermosensitivity. F r d . PWC. 40, 2808-2813. Carpenter, D. O., and Alving, B. 0. 1908. A contribution of an electrogenic Na' pump to membrane potential in A ! J / ! / ~neurons. ~[I J . Gcri. Pliysid. 52, 1-21. Chay, T. R. 1983. Eyring rate theory in excitable membranes: Application to neuronal oscillations. 1. P/J!/s.Clieui. 87, 2935-2940. Chay, T. R. 1984. Abnormal discharges and chaos in a neuronal model system. Biol. Cybrrrf. 50, 301-31 1. Chay, T. R., and Kang, H. S. 1987. Multiple oscillatory states and chaos in the endogeneous activity of excitable cells: I'ancreatic .l-cells as an example. In Chnos iu Biologic-a/ S!/strrris, H. Degn, A. V. Holden, and L. F. Olsen, eds., pp. 173-181. Plenum, New York. Cliay, T. R., and Kang, H. S. 1988. Role of single-cliannel stochastic noise on bursting clusters of pancreatic .kells. Biophys. 1. 54, 127435. Chay, '1. R., and Fan, Y. 1993. Evolution of periodic states and cliaos in two types of neuronal models. I n Chaos irf BiokJgy mid Merficirir, P m .SPIE 2036, 100- 114. Chay, T. R., and Kinzel, J. 1985. Bursting, beating, and chaos in an excitable membrane model. Biophys. J. 47, 357-366. Chay, J. R., Lee, Y. S., and Fan, Y. 19%. Bursting, spiking, chaos, fractals and universality iii biological rhythms. I u t . J. Bifiirc. C h o s (in press).
Encoding in Mammalian Cold Receptors
253
Chialvo, D. R., and Apkarian, V. 1993. Modulated noisy biological dynamics: Three examples. J. Stat. Phys. 70,375-391. Clay, J. R. 1977. Monte Carlo simulation of membrane noise: An analysis of fluctuations in graded excitation of nerve membrane. I. Tlzeor. Biol. 64, 671680. DeFelice, L. J. 1981. Introduction to Membrane Noise. Plenum, New York. Douglass, J. K., Wilkens, L., Pantazelou, E., and Moss, F. 1993. Noise enhancement of information transfer in crayfish mechanoreceptors by stochastic resonance. Nature (London) 365, 337-340. Dykes, R. W. 1975. Coding of steady and transient temperatures by cutaneous "cold" fibers serving the hand of monkeys. Brain Res. 98, 485-500. Fox, R. F., Gatland, I. R., Roy, R., and Vemuri, G. 1988. Fast, accurate algorithm for numerical simulation of exponentially correlated colored noise. Phys. Rev. A 38, 5938-5940. French, A. S., and Holden, A. V. 1971. Alias-free sampling of neuronal spike trains. Kybernetik 8, 165-171. French, A. S., Holden, A. V., and Stein, R. B. 1972. The estimation of the frequency response function of a mechanoreceptor. Kyhernetik 11, 15-23. Gerstein, G., and Mandelbrot, B. 1964. Random walk models for the spike activity of a single neuron. Biophys. J. 4, 4148. Hensel, H. 1974. Thermoreceptors. A n n u . Rev. Physiol. 36, 233-249. Hindmarsh, J. L., and Rose, R. M. 1984. A model of neuronal bursting using three coupled first order differential equations. Proc. Roy. Soc. London 8221, 87-102. Hochmair-Desoyer, I. J., Hochmair, E. S., Motz, H., and Rattay, F. 1984. A model for the electrostimulation of the nervus acusticus. Neuroscience 13, 553-562. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London) 117, 500-544. Hodgkin, A. L., and Katz, B. 1949. The effect of temperature on the electrical activity of the giant axon of the squid. J . Physiol. (London) 109, 240-249. Hodgkin, A. L., and Keynes, R. D. 1955. Active transport of cations in giant axons from Sepia and Loligo. J. Physiol. (London) 128, 2840. Horsthemke, W., and Lefever, R. 1984. Noise-Induced Transitions. Theory and Applications in Physics, Chemistry, and Biology. Springer Series in Synergetics, Vol. 15. Springer, Berlin. Huxley, A. F. 1959. Ion movements during nerve activity. Ann. N.Y. Acad. Sci. 81, 221-246. Iggo, A,, and Iggo, B. J. 1971. Impulse coding in primate cutaneous thermoreceptors in dynamic thermal conditions. J. Physiol. (Paris) 63, 287-290. Junge, D., and Stephens, C. L. 1973. Cyclic variation of potassium conductance in a burst-generating neurone in Aplysia. J. Physiol. (London) 235, 155-181. Kramer, R. A., and Zucker, R. S. 1985. Calcium-induced inactivation of calcium current cawes the inter-burst hyperpolarization of Aplysia bursting neurones. J. Physiol. 362, 131-160. Laiiger, P. 1991. Electrogenic Ion Pumps. Sinauer, New York. Lecar H., and Nossal, R. 1971. Theory of threshold fluctuations in nerves. 1. Re-
254
Andre Longtin and Karin HinLer
lationship betiveen electrical noise and fluctuations in axon firing. Biopliys. /. 11, 1049-1067. Longtin, A. 1993. Stochastic resonance in neuron models. 1. Slot. Ph!/s. 70, 309327. Longtin, A. 1995a. Mechanisms of stochastic phase-locking. C/ino..-5, 209-215. Longtin, A. 199%. Synchronization of the stochastic Fitzhugh-Nagumo equations to periodic forcing. [I Nirovo Cirritwto D (in press). Longtin, A,, Bulsara, A., Pierson, D., and Moss, F. 1994. Bistability and the . 569-578. dynamics of periodically forced sensory neurons. R i d . C ! / ~ J W I70, Moffett, S., and Wachtel, H. 1976. Correlations between temperature effects on behavior in A$ycin and firing patterns of identified neurons. Mar. Bohn71. Physid. 4, 61-74. Kobile, M., Carbone, E., Lux, H. D., and Zucker, H. 1990. Temperature sensitivity of Ca currents i n chick sensory neurones. Pfliig
. /. 16, 227-244. Poulos, D. A. 1981. Central processing of cutaneous temperature information. F r d . P ~ o c .40, 2825-2829. Rinzel, J. 1987. A tormal classification of bursting mechanisms in excitable systems. I n Motli~w~rtiiiil 'Lpiis iri Pop~rlntiorrBiology, Morpliog~wsisntrrf N t v msc-irmc~,Teranioto, E., and Yamaguti, M., eds., Lecture Notes in Biomathematics Vol. 71, pp. 267-281. Springer, N e w York. Rinzel, J., and Lee, Y. S. 1986. In N~~rrlirrr7nr Osiillntio~isirr B i o k y y iird Chuii~stny, H. G. Othnier, ed., Lecture Notes in Biomatliematics Vol. 66, pp. 19-33. Springer-Verlag, Berlin. Rinzel, I., and Lee 1'. S.1987. Dissection of a model for neuronal parabolic bursting. 1.Mntli. Biol. 25, 633-675. Rose, J., Brugge, J., Anderson, D., and Hind, J. 1967. Phase-locked response to low frequencv tones in single auditorv fibers of the squirrel monkey. /. N c /I~I I / ~ ~ / s ~30, ( I / .769-793. Schiifer, K., Braun, H. A., and Hensel, H. 1982. Static and dynamic activity of cold receptors at various calcium levels. 1. Nerir.opli!/siol. 47, 1017-1028. Schafer, K., Braun, H. A., and Renipe, L. 1988. Classification of a calcium conductance in cold receptors. Pros. Brairi Res. 74, 29-36. Schafer, K., Braun, H. A., and Rempe, L. 1990. Mechanisms of sensory transduction in cold receptors. In Tlrcrrtiorucepticri arid T l i r r r r ~ c i r t ~ ~ ~ r l aJ.f iBligh ri, and K. Voigt, ecls., pp. 30-36. Springer-Verlag, Berlin. Scheich, H., Bullock, T. H., and Hamstra, R. H. Jr. 1973. Coding properties of
Encoding in Mammalian Cold Receptors
255
two classes of afferent nerve fibers: High-frequency electroreceptors in the electric fish, Eigenmannia. 1. Neurophysiol. 36, 39-60. Shadlen, M. N., and Newsome, W. T. 1994. Noise, neural codes and cortical organization. Curr. Opin.Neurobiol. 4, 569-579. Sigeti, D., and Horsthemke, W. 1989. Pseudo-regular oscillations induced by external noise. I. Stat. Phys. 54, 1217-1222. Spekreijse, H. 1969. Rectification in the goldfish retina: Analysis by sinusoidal and auxiliary stimulation. Vision Res. 9, 1461-1472. Stevens, C. F. 1972. Inferences about membrane properties from electrical noise measurements. Biophys. 7. 12, 1028-1047. Usher, M., Stemmler, M., Koch, C., and Olami, Z. 1994. Network amplification of local fluctuations causes high spike rate variability, fractal patterns and oscillatory local field potentials. Neural Comp. 5, 795-836. Wiederhold, M. L., and Carpenter, D. 0. 1982. In Cellular Pacemakers. Vol. 2: Function in Normal and Diseased States, D. 0.Carpenter, ed., pp. 27-58. WileyInterscience, New York. Willis, J. A., Gaubatz, G. L., and Carpenter, D. 0. 1974. The role of the electrogenic sodium pump in modulation of pacemaker discharge of Aplysiu neurons. 1.Cell. Physiol. 84, 463471. Wilson, W. A., and Wachtel, H. 1974. Negative resistance characteristic essential for the maintenance of slow oscillations in bursting neurons. Science 186, 932-934.
Received November 8, 1994; accepted June 14, 1995
This article has been cited by: 1. Xiufeng Lang, Qishao Lu, Jürgen Kurths. 2010. Phase synchronization in noise-driven bursting neurons. Physical Review E 82:2. . [CrossRef] 2. Georgi S. Medvedev. 2009. Electrical Coupling Promotes Fidelity of Responses in the Networks of Model NeuronsElectrical Coupling Promotes Fidelity of Responses in the Networks of Model Neurons. Neural Computation 21:11, 3057-3078. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Pawel Hitczenko, Georgi S. Medvedev. 2009. Bursting Oscillations Induced by Small Noise. SIAM Journal on Applied Mathematics 69:5, 1359. [CrossRef] 4. Qishao Lu, Huaguang Gu, Zhuoqin Yang, Xia Shi, Lixia Duan, Yanhong Zheng. 2008. Dynamics of firing patterns, synchronization and resonances in neuronal electrical activities: experiments and analysis. Acta Mechanica Sinica 24:6, 593-628. [CrossRef] 5. G Tanaka, K Aihara. 2007. Collective skipping: Aperiodic phase locking in ensembles of bursting oscillators. Europhysics Letters (EPL) 78:1, 10003. [CrossRef] 6. Martin Huber, Hans Braun. 2006. Stimulus-response curves of a neuronal model for noisy subthreshold oscillations and related spike generation. Physical Review E 73:4. . [CrossRef] 7. Yang Zhuo-Qin, Lu Qi-Shao. 2006. Bursting and spiking due to additional direct and stochastic currents in neuron models. Chinese Physics 15:3, 518-525. [CrossRef] 8. Alexander Neiman, David Russell. 2002. Synchronization of Noise-Induced Bursts in Noncoupled Sensory Neurons. Physical Review Letters 88:13. . [CrossRef] 9. Hans Plesser, Theo Geisel. 2001. Stochastic resonance in neuron models: Endogenous stimulation revisited. Physical Review E 63:3. . [CrossRef] 10. Peter Roper , Paul C. Bressloff , André Longtin . 2000. A Phase Model of Temperature-Dependent Mammalian Cold ReceptorsA Phase Model of Temperature-Dependent Mammalian Cold Receptors. Neural Computation 12:5, 1067-1093. [Abstract] [PDF] [PDF Plus] 11. Ulrike Feudel, Alexander Neiman, Xing Pei, Winfried Wojtenek, Hans Braun, Martin Huber, Frank Moss. 2000. Homoclinic bifurcation in a Hodgkin–Huxley model of thermally sensitive neurons. Chaos: An Interdisciplinary Journal of Nonlinear Science 10:1, 231. [CrossRef] 12. Elad Schneidman , Barry Freedman , Idan Segev . 1998. Ion Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike TimingIon Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike Timing. Neural Computation 10:7, 1679-1703. [Abstract] [PDF] [PDF Plus]
13. V. Galdi, V. Pierro, I. Pinto. 1998. Evaluation of stochastic-resonance-based detectors of weak harmonic signals in additive white Gaussian noise. Physical Review E 57:6, 6470-6479. [CrossRef] 14. Marisciel Litong, Caesar Saloma. 1998. Detection of subthreshold oscillations in a sinusoid-crossing sampling. Physical Review E 57:3, 3579-3588. [CrossRef] 15. Wei Wang, Yuqing Wang, Z. Wang. 1998. Firing and signal transduction associated with an intrinsic oscillation in neuronal systems. Physical Review E 57:3, R2527-R2530. [CrossRef] 16. André Longtin. 1997. Autonomous stochastic resonance in bursting neurons. Physical Review E 55:1, 868-876. [CrossRef] 17. Epifanio Bagarinao, Caesar Saloma. 1996. Frequency analysis with Hopfield encoding neurons. Physical Review E 54:5, 5516-5521. [CrossRef]
Communicated by Peter Foldiak
NOTE
Associative Memory with Uncorrelated Inputs
In hybrid learning schemes a layer of unsupervised learning is followed by supervised learning. In this situation a connection between two unsupervised learning algorithms, principal component analysis and decorrelation, and a supervised learning algorithm, associative memory, is shown. When associative memory is preceded by principal component analysis or decorrelation it is possible to take advantage of the lack of correlation among inputs to associative memory to show that correlation matrix memory is a least squares solution to the supervised learning problem. 1 Introduction ____
Hybrid learning schemes employ an unsupervised learning algorithm to transform raw input data into a more useful form. Unsupervised learning is then followed by supervised learning to learn some desired output. Several authors have published unsupervised learning algorithms for principal component analysis (Oja 1992), and Foldiak (1989) has published an algorithm for the decorrelation of input vectors. It is possible to take advantage of the special form of the output from these algorithms in the design of a supervised learning algorithm. 111designing that algorithm it is interesting to consider the central point of Fuster (1995), "nll r m w o y is nssociotiw." 2 Discussion __
___
...
As a starting point, consider the optimal linear associative mapping (OLAM) o f Kohonen (1988). The idea of associative memory may be expressed in a matrix-vector equation:
M x ~= yi
k = 1.2... . . p
(2.1)
where XI t '%"I is a zero mean key vector, yi is the response vector, and M is the memory matrix. The { x i } and {yk} may be combined into matrices X and Y. Associative memory may then be expressed as
(2.2)
MX=Y \CIIIO/
256-259 (19%)
~ l i ~ ~ l ~ ~ f l f 18, 7 h l l l ~
@ 1996 Masachusett\ Institute of Technology
Associative Memory with Uncorrelated Inputs
257
The least-squares error solution for M may be calculated using the pseudoinverse as shown in equation 2.3 M = YX'
(2.3)
where M is the least-squares error solution for M. In Kohonen's discussion of the OLAM, it was assumed that p 5 n. If the assumption is made that p > n then the definition of the pseudoinverse is Xf
= xT(xxT)-'
(2.4) and not X+ = (XTX)-'XT as for the OLAM. If p > n then the {xk} cannot be linearly independent and perfect recall of the associated {yk} is not possible. Error minimization is the best that can be hoped for. Now assume that the principal components of the stream of data vecThis assumption can easily be realized tors {xk} lie along the axes of P. by passing a stream of raw data vectors through a principal component algorithm such as the weighted subspace learning algorithm (Oja 1992) or a decorrelation algorithm (Foldihk 1989), the output of which is the stream of {xk}. Then, given sufficiently large p , the autocorrelation matrix of the xk, C, is diagonal with the eigenvalues of C lying along the diagonal. This may be summarized as follows:
r
01
XI
(2.5) 1 0
A,,
1
This leads to a simple expression for (XXT'-' 1 .
(2.6)
1'
A,,
Now the pseudoinverse may be expressec as follows:
Ronald Michaels
258
where each column of the resulting AX matrix represents one pattern vector divided elementwise by the eigenvalues of C. Since each eigenvalue represents the value of the variance of one of the elements of xk the eigenvalues are locally computable using the recursive method due to Oja (1983), or a variation thereof. Equation 2.3 above may now be written as 1
M = -Y (XX)T
(2.8) P which is nothing more than correlation matrix memory with variance normalization. In the case where input vectors have been decorrelated and have had all variances equalized and scaled to a value of 1.0 as a part of the unsupervised learning scheme (Foldiiik 1992) then AX = X and equation 2.8 reduces to pure correlation matrix associative memory. Note that equation 2.8 is a least-squares error solution for M, but that it does not require the matrix multiplication and inversion of the pseudoinverse. The recursive form of equation 2.8 can be derived as follows. For p pairs of xi; and yk vectors the solution is 1 MI’ = -yY Y
For p
(2.9)
+ 1 pairs of xp and yi; vectors the solution is (2.10)
Equation 2.10 may be approximated as follows:
r (2.11)
Equation 2.11 may then be rewritten as follows: (2.12) For the recursive version of the algorithm [ l / ( p + l ) ]may be considered a gain factor, which may be represented by y. It is assumed that the eigenvalues of C are recursively updated at each step by a separate algorithm. Equation 2.12 may now be rewritten in the following form: MF’ - yMp + y [y!J+l( AxY+l)T] A
MY+1 =
(2.13)
In equation 2.13 the term -7MI’ may be considered a “forgetting term.” The y!J+’(AX!’+’ ) term may be considered a Hebbian learning term.
Associative Memory with Uncorrelated Inputs
259
The convergence rate of the recursive associative memory algorithm depends upon, among other things, the convergence rate of the preceding recursive principal component analysis or decorrelation algorithm and the recursive algorithm used to estimate the variance of the elements of the {xk}. It should be noted in passing that the above pseudoinverse technique is applicable to the Ho-Kashyap (Ho and Kashyap 19651algorithm when that algorithm is preceded by principal component analysis.
3 Conclusions In hybrid learning schemes a layer of unsupervised learning is followed by supervised learning. In this situation a connection between two unsupervised learning algorithms, principal component analysis and decorrelation, and a supervised learning algorithm, associative memory, has been shown. The output vectors of these unsupervised learning schemes have, in the limit, a diagonal autocorrelation matrix. This allows the pseudoinverse to be computed in a very simple manner using local, recursive computations. Using this computation, it has been shown that correlation matrix associative memory is a least-squares solution to the supervised learning problem. References Foldiak, P. 1989. Adaptive network for optimal linear feature extractor. Proc. lnt. Joint Conf. Neural Networks 401405. Foldiak, I? 1992. Models of Sensory Coding. Tech. Rep. CUED/F-INFENG/TR 91, Physiological Laboratory, University of Cambridge, January. Fuster, J. M. 1995. Memory in the Cerebral Cortex. MIT Press, Cambridge, MA. Ho, Y-C., and Kashyap, R. L. 1965. An algorithm for linear inequalities and its applications. IEEE Trans. Electronic Computers EC-14(5), 683-688. Reprinted in: Pattern Recognifion, J. Sklansky ed., pp. 49-54. Dowden, Hutchinson & Ross, Stroudsburg, PA. Kohonen, T. 1988. Self-Organization and Associative Memory. Springer Series in lnformafion Sciences, 2nd Ed. Springer-Verlag, Berlin. Oja, E. 1983. SubspaceMethods ofPattern Recognition. Research Studies Press Ltd., Letchworth, Hertfordshire, England. Oja, E. 1992. Principal components, minor components, and linear neural networks. Neural Networks 5(6), 927-935.
Received April 10, 1995; accepted June 8, 1995.
NOTE
Communicated by Jurgen Schmidhuber
Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps
According to Barlow (1989), feature extraction can be understood as finding a statistically independent representation of the probability distribution underlying the measured signals. The search for a statistically independent representation can be formulated by the criterion of minimal mutual information, which reduces to decorrelation in the case of gaussian distributions. If nongaussian distributions are to be considered, minimal mutual information is the appropriate generalization of decorrelation as used in linear Principal Component Analyses (PCA). We also generalize to nonlinear transformations by only demanding perfect transmission of information. This leads to a general class of nonlinear transformations, namely symplectic maps. Conservation of information allows us to consider only the statistics of single coordinates. The resulting factorial representation of the joint probability distribution gives a density estimation. We apply this concept to the real world problem of electrical motor fault detection treated as a novelty detection task. 1 Information Preserving Nonlinear Maps
Unless one has a priori knowledge about the environment, i.e., the distribution of the input signals, it can be difficult to find criteria for separating noise from useful information. To extract structure from the signals, one applies statistical decorrelating transformations to the input variables. To avoid a loss of information, these transformations have to preserve entropy. According to Shannon (1948) entropy is defined as H i s ) = - J p ( x ) l n P ( . u ) d of s a continuous distribution p ( r ) , with .y E R”. Continuous entropy is sensitive to scaling. Scaling coordinates changes the amount of information (or entropy) of a distribution. More general, for an arbitrary mapping on R”: y = f ( x ) condition det(i)f/ilx)= 1 yields H(y1 = H ( s ) (Papoulis 19911, i.e., local conservation of volume guarantees constant entropy from the input .Y to the output y. To avoid spuNt,i[ui/C o ? i i p i ~ t ~ i t i o8,~ i260-269 (1996)
@ 1996 Massachusetts Institute of Technology
Statistical Independence and Novelty Detection
261
rious information generated by a transformation, we consider therefore volume-conserving maps, i.e., those with unit Jacoby determinant. The goal of this paper is to present a special neural-network like structure for building volume preserving transformations. Two approaches may be used to achieve this goal. First, one may prestructure the neural network in such a way that volume preservation is guaranteed independent of the network weights (Deco and Brauer 1995; Deco and Schiirman 1995). Alternatively, weight constraints may be used to restrict the learning algorithms to volume conserving network solutions. In this paper we present a new prestructuring technique that is based on symplectic geometry in even-dimensional spaces (n = 2m). The core of symplectic geometry is the idea that certain "area elements" are the analogue of "length in standard Euclidean geometry (Siege11943). Transformations that preserve these area elements are referred to as symplectic. Symplectic transforms do also preserve volume. However, the converse is not true, i.e., volume preservation is not sufficient for symplecticity. The advantage of symplectic transforms is the fact that they can be parameterized by arbitrary scalar functions s ( ~z) €; R2"' due to the implicit representation' (1.1)
where the I denotes the n-dimensional identity matrix, and x.y E R21n. Any nonreflecting symplectic transform {det[l- (dfldx)] # 0} can be generated by an appropriate function s, and also the converse is true: Any twice differentiable scalar function, e.g., a n arbitrary standard neural network, leads to a symplectic transform in equation 1.1. A discussion of the origin and significance of the structure of equation 1.1 has to be avoided since it gives little insight in the main issue of statistical independence. We use the representation (equation 1.1) from a pragmatic point of view. To obtain a set of symplectic transforms that is as general as possible, we use a one-hidden-layer neural network NN as a general function approximator (Hornik et al. 1989) for the generating function S: S(Z) = "(2,
W , W) = w
. g(WZ)
(1.2)
where W denotes the input-hidden weight matrix, w the hidden-output weights, and g the activation function. Equation 1.1 has to be solved numerically. We use either fixed-point iteration or a homotopy-continuation method (Stoer and Bulirsch 1993). ~
'This representation of symplectic maps is a special case of the generating function theory developed in full generality by Feng Kang and Qing Meng-zhao (1985). A proof of the representation (equation 1.1) and a discussion of its role for Hamiltonian systems can be found in Abraham and Marsden (1978) and Miesbach and Pesch (1992).
262
L. Parra, C. Deco, and S. Miesbach
2 Mutual Information and Statistical Independence
The components of a multidimensional random variable y E R” are said to be statistical independent if the joint probability distribution p ( y ) factorizes, i.e., p(y) = p(yI). Here, p(yl) represents the distribution of the individual coordinates yI, i = 1,. . . . n of the random variable y. Statistical independence can be measured in terms of the mutual information
n:’
W Y
)I
(2.1) Zero mutual information indicates statistical independence. Here, H(y,) = - J p(yI)In p(y,)dy, denotes the single coordinate entropies. In the case of gaussian distributions, linear decorrelation, i.e., diagonalizing the correlation matrix of the output y, has been proven to be equivalent to minimizing the mutual information (Papoulis 1991)and corresponds to the standard principal component analysis (PCA) method. However, for general distributions, decorrelation does not imply statistical independence of the coordinates. Starting from the principle of minimum mutual information, Deco and Brauer (1994) formulated criteria for decorrelation by means of higher orders cumulants. A similar approach, that considers the distance to the gaussian distribution (standardized mutual information) but restricts itself to linear transformations, was studied by Comon (1994). Redlich (1993) suggested the use of reversible cell automata in the context of nonlinear statistical independence. Instead of preserving information the invertibility of the map was considered. While invertibility indeed assures constant information when dealing with discrete variables, for continuous variables, conservation of volume is necessary. In the case of binary outputs, maximum mutual information has been proposed instead (Schmidthuber 1992; Deco and Parra 1994). In the context of the blind separation problem, Bell and Sejnowski (1994) proposed a technique for the separation of continuous output coordinates with a single layer perceptron. But the authors admit that the information maximization criterion they use does not necessarily lead to a statistical independent representation. In parallel Nadal and Parga (1994) based this idea on a more rigorous discussion. In this paper, we make use of the more general principle of minimal mutual information (statistical independence) instead of the decorrelation used in PCA. For the symplectic map, the identity H ( x ) = H(y) holds, and therefore we are left with the task of minimizing the sum of the single coordinate entropies (second term in the left-hand side of equation 2.1). Since we are given only a set of data points, drawn according to the output distributions, this is still a difficult task. But fortunately, there is a feasible
Statistical Independence and Novelty Detection
263
upper bound for these entropies (Parra et al. 1995),
where (y,) = j p(y,)yldy,. Using only the second-order moments for estimating the mutual information might be seen as a strong simplification. At the expense of computational efficiency, higher order cumulants may be included to increase accuracy. An interesting property of equation 2.2 is that, if the transformation y = f ( x ) is flexible enough, this cost function will produce gaussian distributions at the output. Using a variational approach it can be shown that under the constraint of constant entropy a circular gaussian distribution minimizes the sum of variances in equation 2.2 (Parra et al. 1995). This will be useful for the density estimation addressed next. We will observe there some limitations of the continuous volume conserving map in transforming arbitrary distributions into gaussians. The training of the network (equation 1.2) can be performed with standard gradient descent techniques. The gradient of the output coordinates with respect to the parameters of the map can be calculated by implicitly differentiating equation 1.1. This leads to a system of linear equations for the gradient. The overall computational complexity of the optimization algorithm is then O(n4) for each data point. This restricts this approach to a low dimensional space (in practice n 5 30). 3 Density Estimation and Novelty Detection
If one knows that a joint distribution factorizes, then the problem of finding an estimation of the joint probability p(x) in an n-dimensional space is reduced to the task of finding the one-dimensional probability distributions p(yI). As stated before, the gaussian upper bound cost function favors gaussian distributions at the output, provided that the symplectic map is general enough to transform the given distribution. Figure 1 demonstrates this ability. If the training succeeds, we might estimate the distributions by the straightforward assumption of independent gaussian distributions at the output:
Estimation reduces then to the measurement of the output variances of-. We now address the closely related task of novelty detection. Given a set of samples corresponding to a prior distribution, one has to decide whether or not a new sample corresponds to this distribution. Putting it into other words the question is: "How probable is an observed new
L. Parra, G. Dew, and S. Miesbach
264
Symplectic Output after Training
Nonlinear Correlated Input Training Set 2,
' 7
34-
15t
1
. , .. . . . . ... 44
5
44
- -3
-3 2
2.
1 1
0 0
11
22
Figure 1: Nonlinear correlatal and nongaussiaii joint input distribution (left) is transformed into almost independent normal distributions (right). The input distribution was generated by mapping a one-dimensionalexponential distribution with additive gaussian noise onto a circle. The cost function was reduced by 68'; in 300 training steps. The "network" contained 6 parameters ( 7 0 t R' and W F R' x R 2 ) .
sample according to what we have seen so far?" Given a certain decision threshold, novelty detection is based on the corresponding contour of the density of the data points previously seen. If the contour is required for an arbitrary threshold, we need the complete estimation of the density. As a solution to this problem we propose the presented symplectic factorization with the a posterior gaussian density estimation (equation 3.1). The decision surface for the novelty detection is then just a hypersphere in the output of the symplectic map after reducing the mutual information according to the given sample set. Figure 2 demonstrates this idea. The symplectic map was trained to reduce mutual information on the samples + . The samples o are to be discriminated. The procedure transforms the output distribution to a gaussian distribution as closely as possible, to use a circular contour of the density as a decision boundary. As a side effect, volume conservancy tends to separate regions not belonging to the training set from those corresponding to it. The former regions are mapped far away from the gap area. Obviously, taking a circular decision measure at the output distribution will give a fair solution. We show the performance of the proposed technique in Figure 3 (left) by showing the standard graph of inisclassification and false-alarm rates. For this illustrative example we
Statistical Independence and Novelty Detection
input Tfaining andTest Sol 4,
I
,
I
,
,
,
265
SympMc OuIputafferTminiog
10,
,
I
I
I
,
,
I
I
,
3-
21-
Q1-
k-
2
3
. .-.,
1
Figure 2: +, training samples; a, test samples. Left: input signals; center: output signals of the trained symplectic map. The symplectic map partially transforms a bimodal training distribution into a unimodal distribution. The map used again 6 parameters. Ellipses indicate possible classification boundaries for the + samples. Right: rate of misclassification and false-alarm. We used in both cases (input and output) an elliptical distance measure as decision criteria for novelty, i.e., we classify as "normal" all points lying within an elliptical area around the center of the "normal" training set. All others are classified as "novel." The decreasing curves gives the false-alarm rate, while the increasing curves denote the rate of missing the "novel" data points. could also obtain good results with a simple gaussian mixture (Duda and Hart 1973) of two gaussian spots. This example also demonstrates one of the possible limitations of the technique as a general density estimation procedure. Perfect transformation into a single gaussian spot requires a singularity to map the two spots arbitrarily close together. Because of the property of local conservation of volume, vanishing distance in one direction implies unbounded stretching in the orthogonal direction, which will not be possible with a continuous map. More generally speaking, the combination of a continuous and volume conserving map together with a unimodal distribution is best suited for distributions spread over connected regions rather than for disjoint distributions. For the novelty detection this behavior is clearly an advantage since it separates known distributions from unknown regions.
4 Motor Fault Detection
In this section, we show that the proposed concept of novelty detection provides encouraging results in a high dimensional real world problem. In motor fault detection, the task consists of noting early irregularities
266
L. Parra, G. Deco, and S. Miesbach
. . .
Figure 3: Left: distribution of the 2 first principal components demonstrates a clear nonlinear depcndency. Center: resulting distribution of the same first 2 components after reducing the redundancy in the first 10 components. Right: For any other component higher than the 15th no pairwise dependency could be observed. Here, we arbitrarily plot components 50 and 100.
in electrical motors by monitoring the electrical current. The spectrum of the current is used as a feature vector. The motor failure detector is trained with data supplied by a healthy motor and should indicate if tlie motor is going to fail. Typically, one deals here with at least 100 and u p to 1000 dimensions. Applying the outlined procedure to the complete feature vector is not manageable because of the high computational costs of our training procedure. On the other hand, it is hard to believe that 100 or more coordinates are altogether nonlinearly correlated. More likely, we expect most of the coordinates to be (if at all) cmly linearly correlated. Therefore, we first transform tlie spectrum with a linear PCA. We use 230 coordinates in the spectrum between 20 and 130 Hz. We observed that a few of the first principal components are nonlinearly correlated. N o pairwise nonlinear structure could be observed between coordinates other than the first 10 or 15 first principal components. We assume that all other principal components are uncorrelated, unimodal, and symmetrically distributed. They can be fairly well approximated by a normal distribution (see Fig. 4). We know tliat for normal distributions, linear decorrelation is the best that can be done to minimize mutual information. Therefore, we can assume that these lower principal components are statistical independent. We apply the symplectic factorization only to the first few components. Figure 4 shows how two of the first 10 principal components have been transformed by a 10-20-10 symplectic map trained with 800 samples (wE R"). W E RZ0 x XI").The net reduced variance by 65% in 650 training steps
Statistical Independence and Novelty Detection
267
Figure 4: Left: maximum measure on the 230-dimensional principal component space. Center: circular distance measure on the 10 symplectic mapped first linear principal component. Right: combined symplectic and linear features space: 10 symplectic transformed first PC and last 7 PC of the normalized spectrum. The decreasing curves give the false alarm rate. Each of the three increasing curves provides the rate of missing the fault for three different fault situations (-, no fault; . . ., bearing race hole, - -,unbalance; broken rotor bar). Now we use this result to classify "good" vs. "bad" motors, according to equation 3.1. Since the performance may vary for different types of faults, we plot the performance curve for the three failure modes occurring in our test data (unbalanced, bearing race hole, and broken rotor bar). In Figure 4 we compare the performance with a maximum measure (max,(yl - (y,))) on the complete 230-dimensional principal component space (left), but use only the gaussian estimates of the 10 nonlinear transformed coordinates (center). Furthermore, we analyze to what extent a given coordinate separates the "good motor" and "bad motor" distributions by measuring the ratio of the corresponding variances. This analysis reveals that by including the low variance linear normalized PCA coordinates the classificatiofi measure can be improved further. With "normalized we express the fact that we normalize the variance before performing PCA. Best results were obtained by including between 5 and 20 low variance PCA coordinates (see Fig. 4, right). One possible measurement of the quality of the classification technique is the decision error at the optimal decision threshold. The proposed technique achieves a decision error of 10&0.5%.This result is comparable with different approaches that have been applied to this problem at SCR2including, among others, MLP (11%)and RBF (10%)autoassocia2Siemens Corporate Research, Inc., 755 College Road East, Princeton, NJ 08540
268
L. Parra, G. Deco, and S. Miesbach
tors, nearest neighbor (18-32% ), a n d hypersphere (37%) clustering, PCA (12%), or maximum measure (in roughly 2000 dimensions) (11%I.
5 Conclusions The factorization of a joint probability distribution has been formulated a s a minimal mutual information criterion under the constraint of volume conservation. Volume conservation has been implemented by a general class of noidinear transformations-the symplectic maps. A gaussian upper bound leads to a computational efficient optimization technique a n d favors normal distributions at the output as optimal solutions. This, in turn, facilitates density estimation, a n d can be used particularly for novelty detection. The proposed technique has been applied successfully to the real world problem of motor fault detection.
References Abraham, R., and Marsden, J. 1978. Foirriclntioiis qf M d i n r i i z s . Benjaniin-Cummings, London. Barlow, H. 1989. Unsupervised learning. Newill Corrip. 1(1),295-311. Bell, A. J., and Sejnowski, T. 1995. An information-maximization approach to blind separation and blind deconvolution. Nci[rd Cor~ip.7(6), 1129-1159. Comon, P. 1994. Independent component analysis, a new concept. Sigrid Processirig 36, 287-314. Deco, G., and Brauer, W. 1993. Nonlinear higher order statistical decorrelation by volume conserving neural architectures. Neirnd Nctiuorks (in press). Deco, G., and Parra, L. 1996. Nonlinear features extraction by redundancy reduction with stochastic neural networks. Nrirral Nctiilorks (to appear). Deco, G., and Schiirman, B. 1995. Learning time series evolution by unsupervised extraction of correlations. P l y s . Rcil. € 51(2), 1780-1790. Duda, R. O., and Hart, I? E. 1973. Pntterri Classificmtiorr mid Sceric Arinhysis. Wiley, New York. Feng Kang, and Qin Meng-zhao. 198.5. The symplectic methods for the computation of Hamiltonian equations. I n Nirriz~rica/M d i o d s for Portid Differentin[ Eqitntioiis, Proceedings of a Conference held in Shanghai, 1987. Lecture Notes in Mathematics. Zhu You-lan and Guo Ben-yu, eds., Vol. 1297, pp. 135. Springer, Berlin. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward neural networks are universal approximators. Ncwrd N&i1orh 2, 359-366. Miesbach, S., and Pesch, H. J. 1992. Symplectic phase flow approximation for the numerical integration of canonical systems. Niirriel: Math. 61, 501-521. Nadal, J-P., and Parga, N. 1994. Non-linear neurons in the low noise limit: A factorial code maximizes information transfer. Netiuork 5(4), 565-581. Pcipoulis, A. 1991. Probnbility, Rniiilorri Vnrinbl~5, ilrid Stoclinstic Processes, 3rd ed. McCraw-Hill, New York.
Statistical Independence and Novelty Detection
269
Parra, L., Deco, G., and Miesbach, S. 1995. Redundancy reduction with information preserving nonlinear maps. Network 6,61-72. Redlich, A. N. 1993. Supervised factorial learning. Neural Comp. 5, 750-766. Schmidhuber, J. 1992. Learning factorial codes by predictability minimization. Neural Comp. 4(6), 863-879. Shannon, C. 1948. A mathematical theory of communication. Bell Syst. Tech. I. 7, 379-423. Siege1 1943. Symplectic geometry. Am. J. Math. 65, 1-86. Stoer, J., and Bulirsch, R. 1993. Introduction to Numerical Analysis. Springer, New York.
Received July 25,1994; accepted May 8, 1995.
This article has been cited by: 1. Victoria J. Hodge, Jim Austin. 2004. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review 22:2, 85-126. [CrossRef] 2. S. Singh, M. Markou. 2004. An approach to novelty detection applied to the classification of image regions. IEEE Transactions on Knowledge and Data Engineering 16:4, 396-406. [CrossRef] 3. M.K. Omar, M. Hasegawa-Johnson. 2003. Approximately independent factors of speech using nonlinear symplectic transformation. IEEE Transactions on Speech and Audio Processing 11:6, 660-671. [CrossRef] 4. Simone Fiori . 2001. A Theory for Learning by Weight Flow on Stiefel-Grassman ManifoldA Theory for Learning by Weight Flow on Stiefel-Grassman Manifold. Neural Computation 13:7, 1625-1647. [Abstract] [PDF] [PDF Plus] 5. D. Martinez. 1998. Neural tree density estimation for novelty detection. IEEE Transactions on Neural Networks 9:2, 330-338. [CrossRef]
Communicated by Robert Jacobs
Neural Network Models of Perceptual Learning of Angle Discrimination
We study neural network models of discriminating between stimuli with two similar angles, using the two-alternative forced choice (2AFC) paradigm. Two network architectures are investigated: a two-layer perceptron network and a gating network. In the two-layer network all hidden units contribute to the decision at all angles, while in the other architecture the gating units select, for each stimulus, the appropriate hidden units that will dominate the decision. We find that both architectures can perform the task reasonably well for all angles. Perceptual learning has been modeled by training the networks to perform the task, using unsupervised Hebb learning algorithms with pairs of stimuli at fixed angles 0 and M. Perceptual transfer is studied by measuring the performance of the network on stimuli with 0’ # 0. The two-layer perceptron shows a partial transfer for angles that are within a distance 17 from 0, where n is the angular width of the input tuning curves. The change in performance due to learning is positive for angles close to 0, but for lH - H’/ = 17 it is negative, i.e., its performance after training is worse than before. In contrast, negative transfer can be avoided in the gating network by limiting the effects of learning to hidden units that are optimized for angles that are close to the trained angle. 1 Introduction
~
The ability of animals and humans to carry out perceptual tasks, such as discrimination of two similar stimuli, improves with practice (Walk 1978). One of the most interesting teatures of this improvement is that it is stimulus selective. For instance, learning to discriminate between two gratings with given orientations or spatial frequencies does not lead to improvement for substantially different orientations or spatial frequencies (Fiorentini and Berardi 19811. Similarly, learning to determine the sign of the offset in a vertically oriented vernier stimulus does not improve the performance for a horizontally oriented vernier stimulus (Poggio rt 01. 1992). This limited transfer to different stimulus parameters
Perceptual Learning of Angle Discrimination
271
suggests that the learning is due to changes in early stages of the sensory pathway, where stimuli characterized by very different parameters are represented by different neurons. As the properties of the neurons in these early stages are relatively well known, especially in the visual cortex (Orban 1984), we can attempt to use this information to study possible neural mechanisms of perceptual learning in these systems. In this work we investigate neural network models of 2AFC discrimination of a pair of stimuli characterized by angles I9 and I9 + 68. The parameter I9 that takes values from -7r to T represents, for instance, the direction of motion of a contour of a visual image. We assume that the performance is limited due to the neuronal noise, i.e., the ambiguity induced by the stochastic responses of the neurons to the stimuli. Thus, the level of performance will depend on the ratio between the stimulus separation SI9 and the internal neuronal noise. For concreteness, we will assume a Poisson statistics for the neuronal noise. The first issue we address is which network architecture is capable of performing the task. To assess the quality of the network performance we will compare their performance with the performance of a discriminator based on a maximum-likelihood (ML) decision, which will be described in Section 2. The performance of the simplest network model, i.e., a single layer perceptron that performs a threshold-linear operation on its inputs, has been studied in Seung and Sompolinsky (1993). In this work it has been shown that a perceptron can reach the ML performance on a single stimulus parameter 0. However, the perceptron cannot yield the ML performance over a range of angles. A perceptron that makes the optimal decision for one angle yields suboptimal decisions for other angles. Moreover, we will show that any perceptron will yield the wrong answer more than 50% of the times, in some range of angles, regardless of the level of noise. In the language of learning theory, the angle discrimination task over the whole range of angles is not realizable by a single-layer perceptron (see, e.g., Hertz et al. 1991; Sompolinsky and Barkai 1993). For this reason, and unless one assumes some mechanisms of rapid modifications of the synaptic weights, it is necessary to adopt a more complex network architecture for modeling this perceptual task. In this work we will consider two network architectures: a feedforward network with one layer of hidden units and a gating network. Both architectures can perform the task for the whole range of angles in the case of small noise. We have studied numerically the optimal average performance of these networks in the presence of noise. These results and their comparison with the ML performance will be presented in Section 3. The main focus of this work is on issues related to the learning of this task by the networks, particularly the phenomenon of perceptual transfer. We assume that the networks are trained to perform the discrimination task using Hebbian learning rules with examples of pairs of stimuli that have a fixed angle 8. The initial state of the networks is such that it yields a reasonably good uniform performance over the whole angular range. In
272
G. Mato and H. Sompolinsky
this case, perceptual transfer is defined as the change in the performance of the system by the training for values of H that are different from the training angle. The models of perceptual learning in the two networks and their qualitatively different behavior with regard to perceptual transfer will be presented in Section 4. We find that the multilayer perceptron displays w g a f i u c perceptiial fraiisfiv, i t . , a worsening of the performance for angles different from the one that has been trained. In contrast, we find that the gating network does not display this phenomenon. In the last section we discuss the results and possible extensions of the models. 2 Maximum Likelihood Discrimination
We consider the task of discrimination of angles in two dimensions in the 2AFC paradigm. The stimuli are visual images that are characterized by an angle 0, -T 5 H 5 + T . This angle could represent, for instance, the direction of motion of the image. Two stimuli, with angles H and H + hH, respectively, are presented, one after the other, in one of two possible orders. The task of the system is to find out the order of presentation, e.g., the output should be +1 if the first stimulus was H + c\H and -1 for the reverse order. We assume that the visual input is represented by the responses of N noisy, angle-tuned neurons. These responses are denoted by a vector r of integers Y,, j = 1. . . . N, where Y, denotes the number of spikes emitted by the jth neuron during a fixed period of time following the stimulus onset. The responses of the neurons to each stimulus are assumed to be independent random processes, with the following probability distribution:
The maximum-likelihood (ML) procedure for discriminating between the two alternatives consists of evaluating PI = P(r 1 H bH)P(r’ 1 0) and P2 = P(r I H)P(r‘ I H hH), where r and r’ are the number of spikes emitted by the neurons during the presentation of the first and second stimuli, respectively. The ML decision is +1 if PI > P2 and -1 if P I < P2. In the limit of large population size, N, the probability of mistake of this rule is given by
+
f
=H ( d ’ / A )
+
(2.2)
where H ( x ) = (27rP1/’ J,x e-”/2dt. The parameter d’ is the discriminability of the stimuli, and is equal to
Perceptual Learning of Angle Discrimination
273
where 1 is the Fisher information
where the brackets (. . .) denote the average with respect to the probability distribution of equation 2.1. The Fisher information measures the total amount of information about the stimulus 0 that is contained in the noisy response vector r. It can be shown that in the limit of large N, the ML procedure is optimal in the sense that it is unbiased and minimizes the square of the decision error (Seung and Sompolinsky 1993). We will consider the responses r I , which consist of discrete events. They are assumed to be variables described by the following distribution, (2.5) where we have denoted the mean (and the variance) of r, by (r,) = ((nu,)’) = h,(H). The function h,(Q)will be called the tuning curve of the ith input neuron. The maximum of h l ( Q ) denoted , by Q1, will be called the preferred angle (PA) of the ith neuron. We assume that all the neurons have the same tuning curve but with different preferred angles, i.e., k,(H) = k(Q - Q1), and also that the PAS are distributed uniformly, so that Q, = 2 ~ j / N- T , (j = 1 . . . .N).The difference between the maximum and minimum values of h(0) will be denoted by n. In the following we will use normalized tuning curve f (Q),defined by (2.6) so that the difference between the maximum and the minimum values of f ( Q )is 1. The factor n is a measure of the magnitude of the mean response. The discriminability for the Poisson case in the large N limit is given by
and
The discriminability d’ represents the signal-to-noise ratio of the system. The quantity A is the factor in the signal-to-noise ratio that does not depend on the form of the tuning curve. The result, equation 2.2, is valid when d‘ is of order 1, i.e., when 60 is of order
I/m.
G. Mato and H. Sompolinsky
274
1.o
0.8
0.6
0.4
0.2
0.0
-3
-2
-1
0
e
1
Figure 1: Normalized tuning curve f(O), equation 2.9, with
2
IZ =
3
1 (rad) and
fml, = 0.01.
An example which we will use frequently in this paper is the following input tuning curve (shown in Fig. 1)
(2.9) where c1 is the width of the tuning curve and fmin is the rate of spikes in the background divided by n. For this tuning curve, and assuming fmln << 1, the discriminability is
(2.10) 2.1 Representation and Noise. In general, the error of a discriminator can arise from two sources: the limitation in the representation and the noise. The problem of representation refers to cases where the architecture of the system is limited in such a way that it cannot implement perfectly the desired decision rule, even in the limit where the signal-tonoise ratio of the responses is large. Systems that can perform the correct
Perceptual Learning of Angle Discrimination
275
decision in the limit of small noise will make errors in the presence of noise due to the unavoidable confusion caused by the fluctuations in the responses. In such systems the error vanishes if the signal-to-noise ratio is large. As demonstrated by equation 2.2, the ML discriminator is limited only by the noise; its error vanishes when the relative strength of the noise vanishes. This occurs when the two stimuli are well separated, i.e., 68 is large compared to l / ~ %or , when the magnitude of the mean response, n, is large. The latter can be achieved to some degree by increasing the duration of the stimulus presentation, up to the time window over which the stimulus is integrated by the input neurons. Finally, we note that as the PAS of the neurons are isotropically distributed between -7r and 7r the ML error is independent of the angle H of the stimuli. 3 Neural Network Models 3.1 Single-Layer Perceptron. ML is the optimal solution but it is not obvious how to implement it in the framework of neural networks. The first question to be considered is the representation of the stimuli. Devos and Orban (1990) assume that the angles of the first and second stimuli activate different input units, and that the stimulus representations used by these two input layers are different. We will use a more symmetric representation of the two stimuli. The output of all the input units will be taken to be the difference of the responses r and i,generated by the two stimuli when they are presented individually. This means that the input neurons perform a temporal gradient of their inputs. This operator can also be viewed as utilizing an appropiate short-term memory mechanism. In a previous work (Seung and Sompolinsky 1993) the performance of the simplest feedforward network, i.e., the single layer perceptron, has been studied. The perceptron performs a weighted sum of the outputs of the input units and thresholds it. This can be written as r = sign@)
(3.1)
where h is internal field of the perceptron, N
k
= c w , ( r l - ri)
(3.2)
j= 1
and w,are the weights of the ith input unit. The probability of error that the perceptron will make on a pair of stimuli with an angle 0 is given by f(B)
=
{@(-are))
(3.3)
where go denotes the correct output, i.e., 00 = +1 if the first stimulus was B + 68 and a. = -1 otherwise. The function O ( x ) is the step function,
276
G. Mato and H. Sompolinsky
i.e., O ( x ) = 0 for x < 0 and O ( X )= 1 for x > 0. The angular brackets (. . .) denote an average with respect to the probability distribution of equation 2.5. To evaluate the performance of the perceptron in the limit of large N, we note that for large N, the field h generated by pairs of stimuli with a fixed angle 8, has a gaussian distribution with mean value N
( h ( 0 ) )= IzbQao CW18J’(8)
(3.4)
]=1
and variance N
(6h2(Q))= t I 2 C 2 ~ f h ( 8 )
(3.5)
]=I
Performing the average of equation 3.3, using the gaussian distribution of h, yields equation 2.2 but with a discriminability that is given by
where A is given as before by equation 2.8. The set of weights that minimizes the probability of error in discriminating stimuli with angle 6’ is found by minimizing equation 3.6, yielding
w,(Q) = f,’(O)/fJQ) (3.7) Substituting these weights in equation 3.6 we find that the discriminability, and hence also the average error, of the optimal perceptron is the same as in ML. However, the performance of the perceptron cannot be optimal for more than a single stimulus angle. In particular, the discriminability of a perceptron that is optimized for an angle H with respect to stimuli with an angle 0’ is given by
where w i ( 0 ) are given by equation 3.7. Using this expression and equation 2.2, the average error of this perceptron can be evaluated. In Figure 2 we plot this error as a function of the stimulus angle 8’. The results show that not only is the performance for 8‘ # 0 suboptimal, but there is a range of angles for which the probability of error is larger than 0.5, namely, the system is doing worse than random. Improving the signal-to-noise ratio, for instance by increasing 68 or increasing ti, leads to an even larger probability of error. In fact, for any fixed set of weights, there is a range of angles for which the perceptron behaves worse than random, for all levels of noise. This can be seen from the fact that the integral of equation 3.4 over the whole range of 0 is zero, implying that the output of the perceptron will necessarily differ from a0 for some range of 8. To realize this task we need to consider more complex networks.
Perceptual Learning of Angle Discrimination
277
0.5
0.0
-3
-2
-1
0
1
2
3
Figure 2: Error probability vs. stimulus angle 0 for the perceptron, optimized for angle 0. N = 50, a = 1 (rad),fmin = 0.01, btl = 3", and n = 50. 3.2 Two-Layer Perceptron. We consider here a feedforward network consisting of an input layer with N units, a single hidden layer with M units, and a single output unit (see Fig. 3). The input units have the same tuning properties as in the previous section, namely their output is given by r, - r:. Each hidden unit is a sigmoidal perceptron, SI
(3.9)
=d h , )
where k , is the internal field of the ith hidden unit, N
(3.10)
h, = C W I , ( T / /=1
and wij is the connection between input unit j and hidden unit i. The sigmoid function g ( h ) will be chosen for convenience as g ( k ) = tanh(k). The output of the system will be assumed to be u = sign
CS, (,MI
)
(3.11)
G. Mato and H. Sompolinsky
278
W.. 'I
1 r-r' Figure 3: Architecture of the two-layer perceptron Note that we consider here a special two-layer network, in which all the weights from the hidden layer to the output unit are equal. The advantage of this restricted architecture is that it is easier to interpret its operation since the system's decision consists of a majority vote of the hidden perceptrons, similar to the cOJ171Jlifff7f~riinc/iirii>architecture (Sompolinsky and Barkai 1993). The two-layer perceptron described above can realize the angle discrimination task in the limit of small noise because it combines the signal from M perceptrons. In a region where one perceptron has the wrong output there will be others with the correct one that will cancel its effect. To see how the network can operate in the limit of small noise, it is sufficient to consider the mean values of the internal fields of the hidden units, equation 3.10, generated by a pair of stimuli with angle 0. For large h', they are equal to
where the weights between the input units and the hidden units have been expressed as = i L ' i ( ( - 1 , ) . Thus, input units with iL):((.)) < 0 provide a contribution with the correct sign, whereas those with w:( o ) :> 0 provide a wrong signal. For a given hidden unit, the total contributions will yield the correct signal fur some H and incorrect ones for other values, as we have shown in the case of a single perceptron. However, here it is sufficient that for all H, (3.13)
Perceptual Learning of Angle Discrimination
279
r--. --.
--. -.--.
I I I
I
W
-3
-2
-1
0
e
1
2
3
Figure 4: Example of weight pattern for the two-layer perceptron that solves the discrimination task in the limit of low noise. Shown are the weights between the input units and two of the hidden units, vs. the PA of the input units. Each weight pattern is a piecewise linear function of the PA. The weights of different hidden units are translated versions of each other (see equation 3.14). This can be achieved if each weight function wl(@) is such that its derivative is negative for most of the range of 4. In addition, the different are arranged so that in the region where one of them has functions wl($) a positive derivative most of the others have a negative one so that they can compensate the wrong signal. A simple example is provided by choosing uniformly displaced weight patterns to the hidden units, Wl(d) =
W($ - 01)
(3.14)
where Q 1 = 27ri/M and the function W(4) is a saw-tooth function (see Fig. 4). For this weight pattern, equation 3.12 yields
(3.25)
G. Mato and H. Sompolinsky
280
wheref = J f($) d$/2.. This result assumes that f is not too wide, that M is sufficiently large, and that the hidden units are saturated. The minimal number of hidden units that is needed depends on the particular form o f f . Of course there are many other solutions for performing the task in the limit of weak noise. In particular, different hidden units may have different profiles of input weights, and the input weights profile may have more than one narrow region of positive derivative. It is important to point out the role of the nonlinearity of the hidden units. The region of the weights with w:(cp)> 0 will give a wrong output. The absolute value of the field for this region will be larger than the one in the region with za:($) < 0 because the integral of the internal field from -7r to i7 must be zero. But this effect is suppressed by the suppressive nonlinearity of the hidden units, as was shown in the above example. 3.3 Optimal Performance of the Two-Layer Perceptron. Here we consider the performance of the two-layer perceptron in the presence of noise. The optimal network will be defined as the one that minimizes f, which denotes the probability of error averaged over all angles,
(3.16) where c ( H ) is given by equation 3.3. To find the optimal network, we use the following on-line gradient descent algorithm. In each iteration an angle H is chosen at random with uniform distribution between [-.. 7r] and a pair of stimuli r a n d r’ is generated at random according to equation 2.5. The weights will be updated according to the following stochastic gradient descent rule (3.17) where zi13eW are the updated weights. The energy E is a quadratic cost function for each example 1 (m (3.18) 2 where cr,] is the correct output for the current example and, in order for the gradient of E to be well defined we use in E a sigmoidal version of the network output, i.e., 5 = tanh (g C, S,). Here g is a gain parameter that is taken to be relatively large so that the final output neuron is close to saturation. This update rule is repeated for each new example, and the average error is monitored. The algorithm is stopped when the observed average error saturates. We would like to emphasize that the supervised rule, equation 3.17, was used to find the optimal two-layer perceptron for the discrimination task, but it will not be used to model the actual process of perceptual learning. The algorithm for perceptual learning will be introduced in Section 4.
E
=
-
Perceptual Learning of Angle Discrimination
0.2
-
0.0
2
281
&
-
A
h
Figure 5: Error probability for the optimal two-layer network with N = 50, = 11, as a function of B for 60 equal to lo,3", and 18" (top to bottom).
M
In Figure 5 we show the performance of a network with N = 50 input units and M = 11 hidden units, obtained using the above minimization procedure, with rl = 0.01,g = 1. The number of training examples was P = 50,000, and they were recycled N,= 500 times. The tuning curves have u = 1 (rad), fnlln = 0.01, b0 = 3", and n = 50. The figure displays the probability of error for stimuli with angle 8, as a function of 8, for three test values of 68. The performance is better than random and improves with increasing the test value of 68. In the limit of noiseless inputs the error goes to zero. The performance is relatively uniform in 8, except for large 60 where the relative nonuniformity is enhanced by the fact that the mean error is extremely small. The performance of the two-layer perceptron was insensitive to the details of the minimization algorithm, such as changing the sigmoidal forms of the output or changing the training parameters, rl, 68, or n. However, it does depend on the number of hidden units. In Figure 6 we show the error probability averaged over all the angles as a function of with all other parameters held fixed, including the input size,
l/m,
G. Mato and H. Sompolinsky
282
M-"2
Figure 6: Average error as a function of l/d% for N = 50 and 17 = 50. (a) The two-layer network. The weights have been calculated by iterating equation 3.17 with P = 50.000 stimuli, generated by the Poisson distribution, equation 2.5, with uniformly sampled H , and hH = 3". The training set was recycled 500 times. The step size is r/ = 0.01. The initial values of the weights are chosen from a gaussian distribution. The average error is measured by averaging the network's error over a set of randomly sampled test stimuli with the same distribution as the training set. The result was further averaged over 5 realizations of the training algorithm; each corresponds to different samples of initial weights and stimuli. The line is a linear fit, with intercept at the origin, which equals 0.031. The error of ML is 0.018 (dotted line). (b) The gating network. Each hidden unit is an optimal perceptron. The parameters of the gating system are described in the text.
N = 50. The extrapolation to M m yields a minimum error of about 0.03. It is interesting to note that this asymptotic value is larger than the ML error, which for the parameters given above yields (via equations 2.7 and 2.2) fML FZ 0.018. Nevertheless, we have checked that the network obeys the same scaling as the error of ML, in the sense that it depends on the parameters i i , N, a n d ;I0 only through the signal-to-noise ratio. ---f
Perceptual Learning of Angle Discrimination
283
15 10
5
w,
0
-5 -10 -15
-3
-2
-1
0
e
1
2
3
Figure 7a: Optimal weights between the input units and the first hidden unit for a two-layer perceptron with N = 50, M = 11, n = 50, and a = 1 (rad). The weights are plotted as a function of the PA of the input units. In Figure 7 we plot the weights between the input unit with PA 8 and three of the hidden units. The weights corresponding to different hidden units are not the same, even when they receive the same input during the learning process. The reason for this is that if the weights for all the hidden units were the same, the system would be equivalent to a perceptron, which is not a minimum of the cost function (equation 3.18). However, to find the solution of Figure 7, the weights must be initialized with different values for different hidden units, because otherwise the system would be always constrained to the subset of identical weights for different hidden units. We can also observe from Figure 7 that different sets of weights have a similar shape. This shape is characterized by a negative derivative in two wide ranges of angles, and a steep positive derivative in two narrow intervals. This profile is similar qualitatively to the solutions we have discussed in the previous section for the zero noise case. Nevertheless, it is interesting to note that the symmetry between the weights of different hidden units is not exact. To check whether this asymmetry is a
G. Mato and H. Sompolinsky
281
-5
-
-10
-
-15
'
-3
1
-2
-1
1
0
2
3
Figure 7b: Optimal weights between the input units and the second hidden unit for a two-layer perceptron with N = 50, M = 11, i f = 50, and a = 1 (rad). The weights are plotted as a function of the PA of the input units. consequence of the small number of hidden units we have measured the asymmetry by evaluating the variation between the hidden units of the largest negative derivative interval of their weights. The standard deviation of these fluctuations was found not to decrease significantly with x limit the symmetry increasing M, suggesting that even in the M between the hidden units is broken. The width of the input tuning curve, n, has an important effect on the structure of the weights. Using l7 = 1.1 bad) instead of a = 1 we find that some of the hidden units have only one positive derivative regime, as shown in Figure 8. As we increase a the proportion of hidden units with this profile increases and for n = 1.5 all units have oniy one positive derivative regime, similar qualitatively to the zero-noise example of Figure 4. Decreasing a, to values significantly less than 1 yields units with three or more positive derivative regimes (not shown). Summarizing, the weight profiles are a compromise between two factors: increasing the number of positive regions increases the absolute value of the derivative of the weights in the regions where the signal has the correct sign
-
Perceptual Learning of Angle Discrimination
15
10
I
- I
285
1
1
-
5 -
w3
0
-
-5
-
-10
-
-15
'
1
-3
-2
-1
0
1
I
1
2
3
0 Figure 7c: Optimal weights between the input units and the third hidden unit for a two-layer perceptron with N = 50, M = 11, n = 50, and a = 1 (rad). The weights are plotted as a function of the PA of the input units. (keeping the absolute size of the weights constant). On the other hand, it increases the regions that contribute the wrong signal. As each of the narrow positive derivative regions contributes a wrong signal from an angular region of size a (see equation 3.12), increasing a tends to reduce their number. 3.4 Gating Network. We have seen above that whereas discrimination of stimuli around a single angle is relatively simple and can be performed well by a single-layer perceptron, discrimination over a wide range of angles is a complicated task. The two-layer perceptron solves the problem in a distributed manner. One of the main characteristics of that system is the fact that the tuning of the hidden units is broad. For practically all angles the discrimination is performed by the summed outputs of all the hidden units. This distributed mode of operation has a significant consequence for the perceptual transfer properties of the system, as will be discussed in the following section. Here we present an
G. Mato and H. Sompolinsky
286
30 20
10
0
-10
-20
-30
-3
-2
-1
0
1
2
3
Figure 8: Optimal weights bettveen the input units and one of tlic hidden units for ii = I .2( r i d ) . The other parameters are as in Figure 7.
alternative architecture for performing the discrimination task. The gating network consists of two types of units: an array of estimators and an array o f loco/ discriminators. The estimators identify roughly the angle H o f the stimuli, and gate the discrimination network, i.e., decide which of the local discriminators will be assigned the discrimination task. Thus, the network simplifies the discrimination task by splitting it into discriminations about narrow angular regimes, which can be solved relatively easily. Similar architecture has recently attracted considerable interest (Jordan and Jacobs 1994). The architecture of the gating network is shown in Figure 9a. It consists of N input units and M hidden units with the same properties as in the two-layer perceptron above. However, here the M hidden units are local discriminators. They are assigned M angles, H, = 27ii/M, i = 1. . . . . M, which denote the range o f operation of each discriminator. To implement the local discrimination, the output of the network is not given by ecpa-
Perceptual Learning of Angle Discrimination
287
M
N
N
tr Figure 9: (a) Architecture of the gating network. (b) Architecture of the gating system.
tion 3.11 but by
(3.19) The M non-negative numbers c, are the outputs of the gating network. The gating system is shown in Figure 9b. It consists of an array of M perceptrons that perform a weighted sum of the N inputs, Y,. Specifically, we choose CI =
exp(x1)
(3.20)
G. Mato and H. Sompolinsky
288
where the internal fields x, are given by (3.21) where ],, are the weights from the Itli input to the ith gating unit, t, are thresholds, and [I is a gain factor. We now have to determine the values of the discriminators’ weights w,,, the gating weights I,,, and thresholds t,, appropriate for solving the angle discrimination task. Following our analysis above, we choose for the it11 local discriminator the weights that are optimal for a perceptron discriminating around an angle HI, i.e., ZL’,, = f’(8,
- 6 )/ f ( Q ,
-
0,)
(3.22)
(see equation 3.7). The gating system’s parameters should be such that for a stimulus angle near HI, c, will be much bigger than c,, j # i, so that S, will dominate the decision made by CT (equation 3.19). One way t o achieve this is to demand that x, be proportional to the log-likelihood that the inputs r were generated by a stimulus, HI, which by Bayes theorem reduces to
s,= ,f lnP(r I 0,) + C
(3.23)
where C is a n arbitrary constant. In general, this result may not be achieved by a linear sum, such as equation 3.21, and more complex gating units than single-layer perceptrons will be needed. However, in the special case of the Poisson distribution equation 3.23 can be achieved in our architecture by choosing N
(3.24) Comparing with equation 2.5 we see that our choice is equivalent to c, x [P(r I
QJI”
(3.25)
The sharpness of the gating is determined by the gain parameter
/j. If
,d = 1, c, is the probability that input r has been generated by the angle 8,.
In the limit of large N the probability distribution has a width O ( l / f i ) and one c, will be much larger than all the others, namely the one that corresponds to the angle 8, with minimal distance from the input angle 0. On the other hand if = O(l/nN), the gating will be broad. All local discriminators with angles 8, that differ from the input angle c) by an amount smaller than u will contribute to the decision. Thus, assuming large M, the width of the gating changes from 0(27r/M) when l j = O(1) (sharp gating) to O(a/2rr) when /j = O ( l / i z N ) (broad gating). In the presence of noise, the level of performance of the gating network depends
Perceptual Learning of Angle Discrimination
289
on parameters such as P, M, and a. In fact, from the above argument it follows that in the Poisson case, if we choose = 1, and take the large M limit, the network performs ML estimation of 8, which is followed by ML discrimination. We thus expect that the network performance will approach the ML performance level in the limit of large M. This is indeed confirmed by our simulation results for the average error of this network, as shown in Figure 6b. Finally, it should be pointed out that unlike the two-layer perceptron, the gating network uses as inputs not only the differences in the responses, r, - r:, but also the individual responses r,. These two sets are used by two different portions of the gating network. The differences are used in the local discriminators, whereas the individual responses r, are used by the units that compute the coefficients c,. As we assume that all pairs of stimuli have small angular separations 68,the precise choice for inputs to the gating units is not important. Instead of r,, we could use r: or ( r , r3/2.
+
4 Perceptual Learning
In this section we address the problem of perceptual learning for the systems described above. We consider a network that has a reasonable but suboptimal performance for all angles, and improve its ability to discriminate between stimuli in a narrow range about a particular angle 0. We then measure the perceptual transfer of learning by evaluating the system's probability of error for angles that differ from the trained angle 0. 4.1 The Two-Layer Perceptron
4.1.1 lnitial State. We first implement the above program in the twolayer network. The initial state of the network is chosen by the gradient descent rule of equation 3.17, with stimuli that have relatively large separation, i.e., 68 = 9", and the input angles are distributed uniformly between -7r and T . The number of learning iterations and the size of the learning step, rj, are chosen so that at the end the system has a very small average discrimination error t (equation 3.16) ( 6 = 0.031, when 68 = 9", but the error for 68 = 2" is high, c = 0.35. As the system has been trained for all the angles, the performance is roughly uniform. The weights of this initial condition are a "noisy" version of the optimal weights of Figure 6. 4.1.2 Learning Algorithm. We now train the system to improve its performance using training examples with 68 = 2", and 0 = 0". For this phase of training, which models the process of perceptual learning in a
G. Mato and H. Sompolinsky
290
psychophysical experiment, we do not use the algorithm of equation 3.17. The reason is that this update rule is a supervised rule, since it depends on the correct output signal, which is assumed to be provided with each input (see equation 3.18). Since perceptual learning is known to occur even without an external error signal (Fiorentini and Berardi 1981; Ball and Sekuler 1987)we use in this phase a n unsupervised learning algorithm. Specifically, after generating randomly a set of inputs r and,'I according to the distribution, equation 2.5, all the weights are incremented, using the following unsupervised Hebbian rule:
w:,= Wl, + Y(Y,
- r:)SI - llWt,Y,
(4.1)
The first term is proportional to the product of the outputs of the presynaptic jth input unit and the ith hidden unit. The last term is a weight decay term, with a decay constant, which depends on Y, + (. Here again one can replace r, by I.: or ( r ,+ r : ) / 2 . The reason for choosing this inputdependent weight decay is to ensure that if we run this algorithm for a long time (with examples drawn with the same angle, say 0 = 0') it will converge to the optimal weights, for this angle, rx f/(O0)/f,(Oo), provided y and r/ are sufficiently small. This can be checked by equating the left hand side of equation 3.26 with 7 4 , and replacing the inputs by their average values. Note that in this asymptotic state, the weights ZL+, are independent of i, namely all the hidden units converge to the same perceptron, which is optimized for the training angle. It should be noted that our model uses supervised training for the initial state, and unsupervised learning for the perceptual learning stage. Since the initial state is presumably achieved by learning with large signal-to-noise, (in our model, large bB), our scenario is consistent with the idea that when the psychophysical task is "easy" the system has an internal error signal (Weiss et al. 1993). 4.1.3 Transfer Curve. The performance of the network after training with 68 = 2" as a function of the angle 0 (the transfer curve) is shown in Figure 10. The probability of error increases as the distance from the trained angle 0" increases, reflecting the stimulus specificity of the learning. However, up to a distance of approximately n from the origin, the error is still smaller than the initial baseline. This implies that there is partial transfer of learning, the range of which is determined by the width of the input tuning curve. However, for larger distances from the origin, the error continues to grow and becomes larger than 0.5, indicating a performance that is worse than random. Furthermore, if we test the performance on stimuli with a larger separation, the systematic negative bias increases, as shown in Figure 10. The appearance of systematic error after the learning stage stems from the fact that during learning with a single stimulus angle, the whole system converges fast to the optimal perceptron for the training angle and
Perceptual Learning of Angle Discrimination
0.8
I
291
I
0.6
0.4
0.2
0.0
-3
-2
-1
0
e
1
2
3
Figure 10: Error probability vs. 0 for the two-layer network (same parameters as in Figure 5 ) after training the system around r9 = 0 with the learning rule of equation 4.1. Dashed line: The initial performance for 60 = 2". Solid line: Performance after training. Error measured with 68 = 2". Dot-dashed line: Performance after training. Error measured with bB = 9". The parameters for the learning algorithm are h0 = 2", P = 5.000, 60 = 2", y = 5 x and T / = 1.5 x lop5. Error evaluated by averaging over 50 realizations of input responses. the ability to perform discrimination for very different angles is completely lost. One way to avoid this catastrophe is to assume that there is a constant active internal "refreshing" mechanism that generates feedback error signals that prevent the development of a systematic negative bias. To model such a scenario, we have added to the unsupervised learning, equation 4.1, with a single angle, a low rate of supervised updates, equation 3.17, with examples generated by input angles, uniformly distributed between -T and T . Consistent with our previous remarks, the supervised signals are generated with "easy" stimuli, i.e., with the same relatively large values of 60,that were used for reaching the initial state (in our case SB = 9"). These infrequent updates will have negligible effect on the improvement for angles close to O", as the errors there are
G. Mato and H. Sompolinsky
292
0.5
0.4
0.3 E
0.2
0.1 0.0
-3
-2
-1
0
1
2
3
Figure 11: Error probability vs. 0 tor two-layer network (N= 50, M = 11,n = 1 (rad), and ii = 50) using a mixture of unsupervised learning and low rates of supervised updates. For details see text. Dashed line: Initial performance. Solid line: Performance after training. Error measured with b0 = 2". Dot-dashed line: After training. Error measured with 60 = 9". The other parameters are as in Figure 7. Average over 50 realizations. small anyway, but they will have a strong effect in angles far away from 0" where the unsupervised learning tends to generate systematic error. As a result, this low-rate supervised uniform learning will prevent the full recruitment of all hidden units to the neighborhood of 0". Instead, several hidden unit weights will remain relatively unchanged from their initial state. By tuning the relative rates of stimulus specific learning with supervised uniformly distributed updates, we have found parameters that successfully avoid the systematic negative bias. An example is shown in Figure 11. Nevertheless, even in this case the level of error for intermediate angles is larger than it was before training. To conclude, we find that perceptual learning in the two-layer network results in an improvement of performance in the neighborhood of the trained angle, a phenomenon that we term positive perceptual transfer. The range of posi-
Perceptual Learning of Angle Discrimination
293
tive transfer is set by the tuning width of the input units, a. For angles far away from the trained angle the performance is the same as its initial level. On the other hand, at an intermediate range of angles, there is a negative perceptual transfer, meaning that the recruitment of units to improve the performance in the trained regime results in worsening of the performance compared with its initial level. 4.2 The Gating Network. Our model for perceptual learning in the gating network assumes that the parameters of the gating units are fixed at their correct values, given by equations 3.20 and 3.24, and the perceptual learning process affects only the weights of the discriminator input weights. This makes it possible to avoid the convergence of all the hidden units to the same optimal perceptron because the gating variables c, can also be used to contain the effect of the learning process. We thus use the following unsupervised Hebbian rule
w;= 7.4,
+ CI [Y(yi $1 s, -
- 7/WI,Y,]
(4.2)
with examples from stimuli with 0 = 0".As in the previous case, the initial values of w,, are determined by a supervised learning rule, similar to equation 3.17, with examples generated uniformly over the whole angular range. The cost function is similar to equation 3.18, but with 8 5 tanh ( E lc,S,). As we have mentioned in the previous section, in the limit of sharp gating, for any stimulus parameter 0 there will be one c, that is much larger than all the others. This is the one that corresponds to the angle 8, with minimal distance from 0. As the derivative of the cost function with respect to wIIcontains a factor c,, the weights of one of the hidden units will be updated by a much larger factor than all the others. The updating would eventually converge to the optimal perceptron for the angle 0,. When different angles are presented during the training process, different hidden units are chosen. If the algorithm is applied during a long time all the units would converge to the optimal perceptron for its corresponding angle 0, and the performance would be the optimal one. To reach the desired initial condition (in which the performance is uniform but not optimal) we stop the supervised learning algorithm at a relatively early stage, so that the system does not reach the optimal performance. The weights in the initial condition are a mixture of the optimal perceptron for each angle and noise coming from the initial condition. In general, the outcome of the model depends on the network parameters. However, its transfer properties are easily understood in the limit of large N and sharp gating, i.e., /? = O(1). In this case, essentially only the local discriminator with 0, FZ 0" will be affected by the learning stimulus angle 0. When a stimulus with 101 > T / M is presented, the output will be determined by a hidden unit that has not been changed during the learning process. Consequently, the error for that angIe wilI
294
G. Mato and H. Sompolinsky
0.1
1
I
I
I
Figure 12: Error probability vs. H for the gating network N = 50, M = 11, 1 (rad), I I = 50, and /j = 1 and the learning rule of equation 4.2. Dashed line: Initial performance for b8 = 2". Solid line: Performance after training. Error measured with bc) = 2". The parameters for the learning algorithm are and = 2.5 x lo-'. Average over SO bH = 2", P = 5.000, 08 = 2", y = 5 x realizations.
R =
be the same as before training. Thus, in this limit the range of any perceptual transfer is 27r/M. If M is large compared to 27r/a, then negative transfer will be avoided. An example of a transfer curve for the above model of perceptual learning in the gating network is shown in Figure 12. The results exhibit only positive transfer, in agreement with the above considerations. In general, suppression of negative transfer can also be achieved in the more realistic regime of broad gating [[j = 0(1/N)l, provided that the angular tuning width of c, is less than about a / 2 . On the other hand, if the gating width is greater than or equal to R then negative transfer will be seen. In this work we have assumed for simplicity that the parameters of the gating units are fixed, whereas the local discriminators can change. We have verified that similar results are obtained if we allow learning also in the gating parameters.
Perceptual Learning of Angle Discrimination
295
5 Discussion We have addressed the problem of solving an angle discrimination task using simple neural network models. We have focused on limitations in performing the task that are induced by the noise in the neuronal representations of the stimuli. Under general plausible assumptions about the noise, the degree of difficulty of performing the task decreases upon increasing two experimentally controlled parameters: the angular separation between the pair of stimuli, 68,and the amplitude of the response, n. The latter can be varied by, e.g., varying the duration of stimulus, or its contrast. In addition the signal-to-noise ratio increases with the number of input neurons, N, that represent the stimuli, neglecting correlations between the fluctuations in their responses. Our formal analysis utilized the limit of N + m. In practice, our analysis applies to N > 30, as has been verified by our numerical simulations. In this work we have focused on the 2AFC discrimination paradigm. Our first goal was to find simple feedforward networks that can perform the task for all angles. We have shown that although the single-layer perceptron can perform the task when the stimuli are concentrated around one angle, the task is unrealizable by a single-layer perceptron even in the limit of large signal-to-noise. The discrimination problem can be thought of as classifying the N-dimensional inputs generated by the stimuli in the two possible temporal orders. Our result means that the input vectors of stimuli from all angles are not linearly separable (see, e.g., Hertz et al. 1991) even if the scatter induced by the noise is neglected. In principle, one way out of this problem is to assume that there is a fast learning mechanism by which, following the presentation of a stimulus, the perceptron weights can rapidly adapt to the angular neighborhood of the current stimulus. For example, in a previous work (Seung and Sompolinsky 1993), it has been implicitly assumed that a fast adaptive mechanism exists that provides information about the rough range of angles of the stimulus, thereby avoiding the appearance of negative d’, of equation 3.6. This amounts to using the absolute value of d’ in the evaluation of the perceptron error probability (equation 2.2). This approach prevents the appearance of systematic error, but yields a nonmonotonic transfer curve (see Figure 3 of Seung and Sompolinsky 1993). Furthermore, it can be shown that it will still predict negative transfer in some range of angles. This behavior is exhibited by the two-layer linear network, based on the population vector, studied in Seung and Sompolinsky (1993). In the present work we chose not to resort to ad hoc assumptions of fast adaptive mechanisms. Instead, we assume more complex network architectures that are capable of implementing the task, without fast adaptation. We have shown that a multilayer perceptron network with one hidden layer can achieve reasonably good uniform performance. Interestingly, extrapolation of our results indicates that even in the limit when the number of hidden units, M, is large the average error does not reach the
G. Mato and H. Sompolinsky
296
level achieved by the ML discriminator. At present, we are not aware of any theoretical result regarding the capability of two-layer perceptrons to approximate ML discrimination. However, it is quite possible that our result is due to the restricted range of M used in the extrapolation, or it reflects the limitation inherent in the gradient descent algorithm, equation 3.17, we have used to find the optimal weights. In addition, we have used a restricted architecture of a cotiivziftee iiiachirie (Sompolinsky and Barkai 1993) where all the output weights have the same fixed value. However, we have found similar results also in simulations of a fully modifiable two-layer network with a backpropagation algorithm (Hertz tJt (71. 1991). The second feedforward architecture w7e have studied is that of a gating network, which consists of an array of M local discriminators, each optimized to discriminate in a narrow range of angles, and an array of M gating units that select for each stimulus the appropriate discriminator by estimating the angular range of the stimulus. We have shown that this network can approach the ML average performance in the limit of large M, but this result relied on our construction, equations 3.22 and 3.24, which is appropriate for the Poisson distribution. We d o not know whether the gating network, with the simple architecture of Figure 9, is capable of achieving the ML performance for other distributions. lrrespective of the quantitative difference in their performance, the two-layer perceptron and the gating network differ qualitatively in their mode of operation. The two-layer network solves the problem in a distributed manner by arranging the weights of the different hidden units in a way that error performed by one of them will be compensated by the others. The optimal arrangement of weight patterns that efficiently achieves this error-correction depends on n. For input tuning widths that are not too broad, n 5 1 (rad), the weight pattern has two maxima in a cycle (see Figure 7). This would show u p as two maxima in the tuning curves of the hidden units. In contrast, the gating network operates by combining estimation with local discrimination. Hence the weight patterns of the discriminators arc similar to that of a perceptron optimized for a single angle, and they are, therefore, less sensitive to the value of (7. I n fact, the discriminators are predicted to have a tuning curve with a single maximum irrespective of n. The performance of the two-layer network is similar to the ML discriminator in that for small c% both require only the temporal difference in the input responses to the two stimuli, i.e., r - r’. Consequently, their performance depends on the parameters M, 1 2 , and N, only through the combination h N j In contrast, to perform the estimation, the gating network uses the response vector r in addition to the difference vector r-r’. Consequently, its performance may improve if I I or N increases even if PW is changed simultaneously, so that 6 H I f i is unclianged. This prediction can be checked experimentally, by changing 11 and M . The above consideration holds in the case of large N, as was assumed throughout
m.
Perceptual Learning of Angle Discrimination
297
this work. Angle estimation by small neuronal populations has been recently studied (Salinas and Abbott 1994). The second goal was to study how the proposed networks learn the task. In particular, we examine how the effect of learning in one angular neighborhood affects performance in other regimes. A central problem in this work is the question of negative transfer. This is defined as a decrease in the performance for angles different from the trained one, relative to the performance before training. Our paradigm for perceptual learning consists of choosing a network with an initial set of weights that yields good uniform performance for easy tasks, namely high signal-tonoise ratio, and then training them with inputs drawn from stimuli in one narrow angular range. For the training algorithm we choose simple unsupervised Hebbian rules (equations 4.1 and 4.2). The most striking feature in the resultant perceptual transfer curve of the two-layer perceptron is the appearance of negative transfer, namely, a range of angles for which the performance after learning was worse than before. This occurs for angles that are separated from the training angle by approximately a. The reason for this behavior is the fact that the learning process affects equally almost all the hidden units, and tends to change their weights so that all of them will resemble a single layer perceptron. This perceptron works fine around the trained angle but is inadequate for dissimilar ones. This effect was quite robust to variations in the learning algorithm, including using a supervised learning algorithm (again with a single angle). In contrast, negative transfer can be avoided in the gating network. This is because the same gating system that operates during the discrimination task can limit the effect of the training to those hidden units that are tuned to the trained angle, thereby leaving intact the performance for different values of 0. In general, suppression of negative transfer in the gating network requires that the gating width will be smaller than RZ a/2. This implies the existence of units with relatively sharp angular tuning. Thus, neurons with sharp tuning could be candidates for such a gating system. In addition, the sharp gating implies that the range of (positive) perceptual transfer is narrower than a. This should be contrasted with the two-layer perceptron where the extent of perceptual transfer is roughly a. It would be interesting to study systematically the general interplay between the parameters a, M , and B. Such a study may shed further light on the general properties of gating networks (Jordan and Jacobs 1994). It would be interesting to test the possible existence of negative transfer in psychophysical experiments. In fact in Ball and Sekuler (1987), Fahle and Edelman (1993), and Weiss et al. (1993) there is some evidence for this phenomenon, although it is observed for only some of the subjects that take part in the experiments. Further experimental studies could clarify the situation. In the present work, we assumed that the input representation of
G. Mato and H. Sompolinsky
298
angles is uniform, so that the system has an underlying rotational symmetry, which is broken only by the external stimulus. Thus, we have ignored the "oblique effect" that denotes the experimental finding that the discrimination is better for vertical and horizontal directions than for oblique directions (Appelle 1972; Heeley and Timney 1988). It would be interesting to study tlie effect of this phenomenon on our model and resu 1ts. Another class of models that has been introduced to study perceptual learning is based on local basis functions (Poggio cf a/.1992; Weiss et a/. 1993). These models also display a limited transfer to parameters very different from the trained one. At present it is not clear whether these models will display negative transfer in a certain regime of parameters. This may depend on the assumptions about the learning rule incorporated in the models. I t would be interesting to study in more detail the perceptual transfer in these models and compare them with the results of the neural network models studied here. Finallv, we note that in this work we focused on the discrimination of angle variables, which has the salient feature of periodicity. It would be interesting to extend our study to discrimination tasks involving other stimulus parameters, such as spatial frequency, or velocity. The properties of tlie system in the general cases are expected to depend on the form of the underlying input tuning curves. In particular, we expect behavior similar to the present case if all the input tuning curves have a nonmonotonic bell shape. This may not be the case for systems with tuning curves that are monotonic functions of the stimulus parameter. Extending our study to general discrimination tasks will presumably invoh~ the incorporation of a mixture of different types of tuning curves, as was done in the study o f discrimination of stereo disparity (Lehky and Sejnowski 1990). References Appelle, S. 1972. Perception and discrimination as a function of stimulus orientation. The "oblique effect" in man and animals. Z'sydid. B i r / [ . 78, 266-278. Ball, K., and Sekuler, R. 1987. Direction-specific improvement in motion discrimination. Vision Rcs. 27, 953-965. Devos, M., and Orban, G. A . 1990. Modeling orientation discrimination a t multiple reference orientations with a neural network. N~wr.n/Coin;?. 2, 152161.
Fahle, M., and Edelman, S. 1993. Long-term learning in vernier acuity: Effects of stimulus orientation, range and of feedback. V i s i a ~Rcs. ~ 33, 397412. Fiorentini, A , , and Berardi, N. 1981. Learning in grating waveform discrimination: Specificity for orientation and spatial frequency. Visioii Rcs. 21, 1119-1 158.
Heeley, D. W., and Timney, 8. 1988. Meridional anisotropies of orientation discrimination for sine wave gratings. Visiori RCS. 28, 337-344.
Perceptual Learning of Angle Discrimination
299
Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to fhe Theory of Neural Computation, Addison-Wesley, Cambridge, MA. Jordan, M. J., and Jacobs, M. A. 1994. Hierarchical mixture of experts and the E. M. algorithm. Neural Camp. 6, 181-214. Lehky, S. R., and Sejnowski, T. J. 1990. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. J. Neurosci. 10, 2281-2299. Orban, G. A. 1984. Neuronal Operations in the Visual Cortex. Springer-Verlag, Berlin. Poggio, T., Fahle, M., and Edelman, S. 1992. Fast perceptual learning in visual hyperacuity. Science 256, 1018-1021. Salinas, E., and Abbott, L. F. 1994. Vector reconstruction from firing rates. J. Comput. Neurosci. 1,89-107. Seung, H. S., and Sompolinsky, H. 1993. Decoding of distributed neural codes. Proc. Natl. Acad. Sci. U.S.A. 90, 10749-10753. Sompolinsky, H., and Barkai, N. 1993. Theory of learning from examples. Proc. IJCNN 93, Nagoya, Japan. Tutorial Volume, pp. 221-240. Vogels, R., Spileers, W., and Orban, G. A. 1989. The response variability of striate cortical neurons in the behaving monkey. E x p . Brain Res. 77, 432436. Walk, R. D. 1978. Perceptual Learning. In HaiidbookofPerception, E. C. Carterette and M. P. Friedman, eds., Vol. IX, pp. 257-298. Academic Press, New York. Weiss, Y., Edelman, S., and Fahle, M. 1993. Models of perceptual learning in vernier hyperacuity. Neural Comput. 5, 695-718.
Received March 6, 1995; accepted July 18, 1995
This article has been cited by: 2. Misha Tsodyks, Charles Gilbert. 2004. Neural networks and perceptual learning. Nature 431:7010, 775-781. [CrossRef] 3. Jason M. Gold, Allison B. Sekuler, Partrick J. Bennett. 2004. Characterizing perceptual learning with external noise. Cognitive Science 28:2, 167-207. [CrossRef] 4. Laurent Itti, Christof Koch, Jochen Braun. 2000. Revisiting spatial vision: toward a unifying model. Journal of the Optical Society of America A 17:11, 1899. [CrossRef] 5. Alexandre Pouget , Kechen Zhang , Sophie Deneve , Peter E. Latham . 1998. Statistically Efficient Estimation Using Population CodingStatistically Efficient Estimation Using Population Coding. Neural Computation 10:2, 373-401. [Abstract] [PDF] [PDF Plus]
Communicated by Sidney Lehky
Directional Filling-in
The filling-in theory of brightness perception has gained much attention recently owing to the success of vision models. However, the theory and its instantiations have suffered from incorrectly dealing with transitive brightness relations. This paper describes an advance in the filling-in theory that overcomes the problem. The advance is incorporated into the BCSlFCS neural network model, which allows it, for the first time, to account for all of Arend’s test stimuli for assessing brightness perception models. The theory also suggests a new teleology for parallel ON- and OFF-channels.
1 Introduction
Light intensity reflected from a surface changes dramatically with change in illumination, but the ratio of intensities (contrast) reflected from adjacent locations remains essentially constant. The visual system extracts the contrast ratio from the distribution of light hitting the retina by local differencing mechanisms of two types: c)r~-ccrit~r/off-s~lrvollrlcl detectors that respond maximally to a light spot surrounded by a dark annulus, and off-crntcr/c)ii-si~~roi~/~~~ detectors that respond maximally to a dark spot surrounded by a lighter annulus. These two distinct populations appear at retinal ganglion cells that project to the visual cortex. Given that the information sent from the retina to the brain is primarily about local luminance and color contrasts rather than about extended areas, why do we experience object surfaces, rather than mere edges? One explanation is that information from the edges ”flows” across the areas that correspond to uniform surfaces, filling them in with features such as color and brightness. Numerous examples of filling-in phenomena appear in the clinical literature: from retinal scotomas (Gerrits and Timmerman 19691, and from experimental work using stabilized images (Krauskopf 1963; Gerrits et ill. 1966; Yarbus 1967). There is also a growing literature on the filling-in of texture information from human psychophysics (Ramachandran and Gregory 1991; Ramachandran t>f 01. 1992)
300-318 (1996)
@ 1996 Massachusetts Institute o f Technology
Directional Filling-in
301
2 Models
Gerrits and Vendrik (1970) developed a qualitative model of the filling-in phenomenon by specifying a filling-in process that works in parallel with a filling-in barrier mechanism. According to their filling-in theory, the ONand the OFF-responses, which peak on opposite sides of a contour edge, fill in over areas that correspond to uniform regions of the stimulus. Mixing of the antagonistic activities is prevented by a boundary, or barrier, that is created at the locations where contrast is high (edges). Grossberg and his colleagues (Grossberg 1983; Cohen and Grossberg 1984; Grossberg and Todorovit 1988) have mathematically specified a neural network model of filling-in called the boundary contour systemlfeature contour system (BCS/FCS) model that instantiates the filling-in theory of Gerrits and Vendrik (1970). The BCS and FCS systems work in parallel: the FCS discounts variable illumination and the BCS generates an emergent boundary segmentation of a scene. The signals from these two systems interact to create visible percepts by filling in surface features within segmentation boundaries. Studies of the temporal dynamics of the BCS/FCS model under visual masking conditions (Arrington 1994a) strongly support the Stoper and Mansfield (1978) conjecture that areasuppression masking is mediated by a sluggish high level filling-in system that follows the fast low-level system mediating contour-suppression masking. Arend (1983) presented a set of test stimuli for assessing brightness perception models. Though the BCS/FCS model has proven successful in predicting the brightness percepts of a variety of stimulus distributions, its ability to account for Arend's complete set has never been demonstrated. This failure occurs because the theory and the model have never dealt adequately with transitive luminance stimuli, in other words, stimuli that have successive contrasts in the same direction: for example, a staircase of luminance steps, as in Figure 1. This paper describes an advance to the filling-in theory, called directional filling-in (DFI). According to DFI, local contrasts build upon the foundation of surrounding brightness levels, rather than being isolated by surrounding boundary signals. The DFI theory is implemented by modifying the BCS/FCS model to include a directionalfilling-in gate (DFIG) that is explained later. Model performance is evaluated using a variety of stimuli. For each stimulus, the brightness predictions from the Grossberg and Todorovit (1988) version of the BCS/FCS model (henceforth referred to as the GT88 model) and from the BCS/FCS with directional filling-in gates (henceforth referred to as the DFIG model) are compared. The GT88 model was chosen for comparison because it is arguably the best known example of traditional filling-in theory and because it has served as a common starting point for a number of derivative models (for example, see Neumann (1993) that is discussed later). It will be shown that the DFIG model accounts for all stimuli in Arends set for evaluat-
302
Karl Frederick Arrington
Figure 1: Illustration of transitive luminance steps. This pyramid stimulus illustrates the type of successive increments (or decrements) in luminance that have been problematic for fillingin theory. ing brightness models, including those with transitive relations-such as multiple cusps and steps, and cusp-separated pedestals-for which the GT88 model does not account, while retaining the ability to account for the Tolhurst effect (Tolhurst 1972). 3 The Theory of the Directional Filling-in ___
A schematic comparison of the traditional Gerrits-Vendrik-Cohen-Grossberg filling-in theory to the DFI theory is shown in Figure 2. Notice that the FCS response depends only on local stimulus contrasts. Consequently, brightness predictions, which are manifest as filled-in activities, depend only on the brightness (darkness) signals contained within a region that is partitioned by the associated boundary. This brightness prediction scheme effectively isolates the input contrast responses in one part of the visual field from those in another part of the field, forming "watertight," noninteracting compartments. Successive luminance steps will tend to appear the same brightness because each isolated brightness and darkness response is identical (see Fig. 2). To overcome the isolation, DFI specifies that each local brightness step builds upon the foundation of previous brightness levels. In a comple-
Directional Filling-in
303
mentary fashion, the darkness also builds upon itself in the activation levels of the OFF-channel filling-in layer. This is accomplished by injecting the lower brightness (darkness) levels up into areas of greater brightness (darkness). Figure 2b illustrates how these signals build upon one another in parallel in the ON- and OFF-filling-in layers. One possible neural implementation of the DFI theory is illustrated in Figure 3. In this model, the upward flow is instantiated by lateral synaptic connections that are facilitated by simple-cells sensitive to the appropriate direction of contrast. Figure 3 (top) shows a luminance stimulus contrast; just below are the FCS ON- and OFF-responses to local contrast. Next are the simple cells’ responses to oriented contrasts. Opposite direction-of-contrast simple cells with the same orientation are added to create a complex cell response that is the BCS boundary to filling-in. In the GT88 model (and presumably in the Gerrits and Vendrik theory, though this is never specified), the boundaries to filling-in are insensitive to direction-of-contrast information, whereas in DFI, the direction-of-contrast information is retained and utilized. Cells S,are filling-in layer cells, whose lateral connections allow diffusive filling-in of feature information. This diffusive filling-in is restricted by boundarymodulated gates, G. Given the stimulus pattern at the top of the figure, a high resistance gate signal (black vertical bar) would form between filling-in layer cells S, and $ + I . Since the stimulus is a step up to the right, the directional gates, U (indicated by white terminal buttons), are active, which facilitates ON-channel activation of cell S,,] by cell S, via the rightward directed axon projection (white arrow), and OFF-channel activation of cell S,by cell S,+, via the leftward directed axon projection (black arrow). This type of system can function properly only if both ON- and OFF-responses operate in the same manner, and suggests an important new role for this parallel system, which is discussed further in Section 5.1. The DFIG mechanism complements the diffusion mechanism that is restricted at the boundaries. It is of interest just how beneficial the addition of this local DFIG mechanism can be to generating accurate global brightness predictions. 4 Comparison of Model Equations
First the GT88 model is fully explained, then the DFIG augmentation is elaborated. Both neural network models consist of a series of feedforward layers beginning with an input layer, where the luminance stimuli are presented, and followed by a series of neural processing layers. The last neural layer, OV, is the output of the model, which shows the predicted brightness perception. The activation of the neurons is determined by differential equations, as described below. The models do not differ substantially in the FCS layer (Section 4.1) or the BCS layer
4,
Karl Frederick Arrington
304
(b) Directional Filling-in
(a) Isolated Filling-in
I
I
Input
I Boundaries
-
On filling-in
i ......I ............................................................. Off filling-in
...........................................................................
Brightness Prediction
s==+ ........................................................................................
/ Incorrect
Correct
Figure 2: Schematic comparison of traditional filling-in theory (a) to directional filling-in (DFI) theory (b) using transitive luminance steps. The input (top row) is the same to both models, as are the ON- and OFF-responses and the boundary responses. The main difference appears at the filling-in stage. In the traditional theory, contrast information is partitioned by the boundary signals so equal ON- and OFF-signals will ca~ic-dto produce a net eigengrau brightness percept. In the DFI theory, actixrity is injected across boundary pzrtitions in the direction of increase. That is, where brightness increases, brightness signals are injected across boundaries to form a brightness floor in the next region, against which the next brightness signai c m deflect. I n a complementary fashion, successive darkness signals build upon one another. The rcsult is that DFI produces a inore accurate brightness prediction.
Directional Filling-in
305
Figure 3: A physiologically plausible neural model using directional filling-in gates. Cells Si are filling-in layer cells; lateral connections allow diffusive fillingin of feature information. This diffusive filling-in is restricted by boundarymodulated gates, G. Given the stimulus pattern at the top of the figure, a high resistance gate signal would form between filling-in layer cells S, and S,,, . Since the stimulus is a step up to the right, the directional filling-in gates, U (indicated by white terminal button), allow brightness activation to flow rightward in the ON-fil!ing-in layer and darkness activation to flow leftward in the OFF-filling-in layer.
306
Karl Frederick Arrington
(Section 4.2); compare Figure 2a and Figure 2b. As far as possible, the DFIG equations have been kept identical to those for the GT88 model. The important difference between the GT88 model and the DFIG model appears at the filling-in layer (Sections 4.3 and 4.4). 4.1 Feature Contour System. The FCS specifies how a light stimulus to the eye is sampled by ON- and OFF-channel retinal ganglion cells through a center/surround receptive field anatomy. The network equations used to model the retinal ganglion cells, x:), are shown in equations 4.1 and 4.2. The superscript, (c), indicates the channel, that is, whether it is an ON-center cell or an OFF-center cell. Parameters P,, D,, and H, are the passive decay rate, depolarization limit, and hyperpolarization limit of the neuron, respectively. For the on-centerloff-sicrrouizn cells the variables and I(surrnund) 11 are the total excitatory and total inhibitory inputs to the neuron, respectively, such that
IFteri
These are reversed for the off-center/oiz-surroiiizn cell
These total inputs are specified as
P)= w, 11
c q$Ipq
(4.3)
P4
where W, is the weighting coefficient and Q&; = exp{-~;~(log[21)[(~ - i)'
+ (4 -j12]}
(4.4)
is the gaussian distribution that specifies the center and surraund receptive fields, which compose the difference of gaussians (DOG) receptive field. Parameter A, in equation 4.4 specifies the spatial bandwidth of the gaussian receptive field. The equilibrium response of equation 4.1 is (4.5)
where r indicates center and s indicates surround. The FCS output is the half-wave rectified cell potential = max(xij (ON) , 0)
XY'
(4.6)
That is, the cell fires at a rate proportional to the depolarization level, but is silent when the cell is hyperpolarized.
Directional Filling-in
307
4.2 The Boundary Contour System. The BCS generates an emergent boundary segmentation of the scene. First cortical simple cells detect colinear contrasts, then complex cells combine the same orientation, but opposite direction-of-contrasts responses from the simple cells. At each spatial location, activations from complex cells of all orientations are combined to form a total boundary signal that is passed through a compressive nonlinear (sigmoid) transfer function to produce the final boundary signal. The first stage of the boundary system, equation 4.7, specifies a mode1 "simple cell" that responds to oriented activations across the field of X,. The activities yllk of the oriented contrast-sensitive cells centered at location (i. j) with orientation k, obey the additive equation
Equation 4.8 specifies the oriented receptive field, of orientation k, which is created using a difference-of-offset-gaussians (DOOG), as follows. The gaussian kernel that forms the negative part of the oriented contrast detector is spatially offset from the location of the detecting cell by vector (-mk, -nk),
where
and where 27rk
(4.10)
mk = sin -
K
and nk = cos
2rk ~
(4.11)
K
where K is the total number of differently oriented contrasts. The potentials from the set of orientation sensitive cells, yil, are half wave rectified to obtain Yilk
= max(y,k.
0)
(4.12)
The rectified potentials of the two "simple cells" with the same orientation, but with opposite directions of contrast, are linearly combined to form a "complex cell," br,k, that is sensitive to orientation, but insensitive to direction of contrast, bilk =
Yqk + Yq[k+(K/Z)]
(4.13)
Karl Frederick Arrington
308
The output from these cells is threshold rectified, (4.14)
B,,k = max(b,,k - L . 0)
where parameter L specifies how much contrast is required before a boundary signal is created. A total BCS signal, B,, (without subscript k), is created by summing the response of all the oriented boundary signals, Bz,k (with subscript k), at location (i.j) BI, =
c
(4.15)
BqA
k
The final BCS signal is insensitive not only to direction of contrast, but also to contrast orientation. In some simulations published in Grossberg and Todorovii (1988) the boundary signal was transformed (4.16)
Bl,k = S ( B , , k )
through the compressive nonlinearity S(X) = klx’/(kz
+ x’)
(4.17)
(Grossberg and Todorovik 1988, pp. 262, 277). This type of transfer function is used to render the boundary signal more uniform in size and to more effectively reduce the diffusion between boundary compartments. All GT88 simulations here use Bl,k to extend the useful contrast range. 4.3 The GT88 Model Filling-in Process. Finally, the FCS and BCS provide parallel input to the filling-in layer. The ON- and OFF-activations flow into each other and cancel except where flow is impeded by high and XjPFF),which are resistance boundary signals. The FCS signals, active only at locations immediately adjacent to stimulus contrasts, are fed into the filling-in layer where the activity freely spreads across neighbor cells. This diffusive filling-in process is impeded by high resistance gating signals, GPL7,,, between locations (i.j ) and (p,9) that are activated by BCS boundaries. The equation for the filling-in layer potential, S,,, is
Xiy’
(4.18)
where Ps is the passive decay rate constant, the X!;) term is the direct input from the FCS, and the F!;) term is the loferol diffusion (filling-in) term (4.19) The term (Spq - Sl,) in equation 4.19 is a discrete approximation to the Laplacian diffusion operator, since the set N,, of locations comprises only
Directional Filling-in
309
the lattice of nearest neighbors of (i,j)
N,, = {(i,j-l),(i-l.j)?(i+l,j),(i,j+l)}
(4.20)
The diffusion gating coefficients, GPql,,that regulate the lateral spread of activation are the same for each channel (c) and depend on the spatially adjacent BCS signals, B, and B,,, as follows:
s
Gpw =
1 + E(B,,
+ Bl/)
(4.21)
Parameter 6 controls the rate of diffusion. A large value will allow rapid diffusion from the edges across uniform areas, which results in a smoother appearance. A smaller value will cause input activation to accumulate where it is input near the boundaries. Parameter E controls the diffusion across boundaries. A large value allows little diffusion across boundaries. The GT88 model used only Xij(ON). By making the excitatory center more heavily weighted than the surround (i.e., unbalanced), the xi,(ON) contained a large positive dc response, i.e., the response to uniform areas was well above zero. Consequently, there was little rectification (equation 4.6) of the hyperpolarization associated with the dark side of a contrast. This allowed the hyperpolarizations to be used in the stead of OFF-responses. As long as the input contrasts are not too Iarge, this corresponds adequately and avoids computing parallel ON- and OFFsignals in subsequent stages, as well as avoids the need to recombine them. For conformity to previously published work, as well as for fair comparison of models, the GT88 simulations use the original model. The DFIG model performs better with a balanced DOG; nevertheless, by the same argument, the DFIG simulations use (4.22) Grossberg (198%) makes it clear that the BCS/FCS theory calls for parallel ON- and OFF-filling-in channels. In general, the brightness percept output of the model, O+, is assumed to be additive such that (4.23) Nevertheless, the GT88 model simulations used only S(ON). As will be seen next, the DFIG model requires that the ON- and OFF-filling-in channels be calculated separately. 4.4 Directional Filling-in Gate Equations. The DFIG model is theoretically equivalent to the GT88 model, except for the addition of the in the filling-in equation, which now becomes directional filling term,
JF),
(4.24)
Karl Frederick Arrington
310
where
JF),
The directional gate Ugi,, used in term depends on the channel (c), whereas the gate G,,,,, used in term F F ) (see equation 4.191, does not. The directional gate that allows activation to flow upward to areas of increasing brightness or increasing darkness is modeled as
To fix ideas, equation 4.26 says: activate directional flow where there exists a sufficient boundary, h(B,,B,, - & B ) , and where the feature signal shows an increase of activation (brightness or darkness), g ( X ~ ) - X ~ ) - - & ~ ) . The directional filling-in term, S,,U&!,, injects activation proportional to the lower side, S,,, according to some function of the feature signal magnitude, Note that in the case whereg(x) is always zero, equation 4.24 reduces to equation 4.18. In the simulations presented here, the simplest directional function is chosen
Ilk;.
k, +- If (x > 0)
(4.27)
and the boundary function, h, is the unit step function at zero. The boundary gates and the directional gates are complementary. Where there is no border present, G,,,, is significant and large, which is insignificant; on allows passive diffusion through term F,,, but U,,,, the other hand, where a border is present, the directional gate, Up,,/, is significant and allows flow through term I, but ,,GPqI1is insignificant and resistance is high. 4.5 Simulation Methods. The DFIG simulations presented here are designed to evaluate the DFI theory and to build intuition about it. To facilitate these goals, the simulations employ an ideal implementation of the DFIG projection field that is a single filling-in cell wide. (In general this would not be the case, as is elaborated in the discussion.) Consequently, the simulations require very sharp and precise boundary signals. Specifically, the additive difference-of-offset-gaussians(DOOG) equation in the DFIG simulations is
(4.28) where each kernel, I",is (1.1, l}. This signal is then thresholded i , , k = rnax(b,,k - L , 0)
(4.29)
Directional Filling-in
311
as in equation 4.14 and transformed as in equation 4.16, to become (4.30)
Bilk = s(Bi,k)
A sharper diffusion gate, G, between two filling-in cells is obtained by multiplying, rather than adding, the adjacent boundary cells, such that (4.31) The expanded one-dimensional DFIG equation is (4.32) The steady-state solutions can be found by solving the linear system
ZS=X
(4.33)
where M is a banded system matrix, which in the one-dimensional case is of the form -
M
*
=
A
-Sl-i(G -
1-1
S,+l (GI 1+1
+ u1,I-i)+ $(J's + G1r-i + GI ( + I ) + ~ l . , , , )
(4.34)
The DFIG kernels were numerically normalized (such that the sum of the kernel elements equals unity) then multiplied by 100. The remaining parameters in the equation were chosen such that spatially uniform stimuli yield a zero response, specifically (DW, - HWS)= 0. Normalization assures a completely balanced center and surround, which helps guarantee symmetric ON- and OFF-responses. In all simulations, these spatial bandwidths of the excitatory and inhibitory kernels of the receptive fields are the same in both the GT88 model and the DFIG model. To help isolate and illustrate the key concept of the DFI theory, the filling-in layer diffusion parameter, 6, was increased so as to produce more uniform brightness levels within bounded areas. This also helps reduce perceptual illusions from the line levels in the simulation results. All parameters are listed in the Appendix; they were the same in all simulations in this paper. 5 Comparison of Brightness Prediction
Brightness predictions of the GT88 model and of the DFIG model are compared using a variety of luminance stimuli. The one-dimensional stimuli should be understood as profiles cut through the two-dimensional brightness displays that historically have been created by rotating disks (Cornsweet 1970). Both the theory and the model can easily be extended to two dimensions.
312
Karl Frederick Arrington
5.1 Parallel DFI Channels: The Pyramid. The staircase pyramid luminance stimulus, shown in Figure 4a, is particularly useful in illustrating the behavior and the power of DFI. The luminance steps consist of equal ratio increments so the contrast for each step is the same and the FCS response to each step is identical, as shown in Figure 4b. The BCS signal is shown in Figure 4c. Traditional isolated filling-in occurs everywhere except at boundary locations, as indicated by diffusion gate activity shown in Figure I d . The rightward and leftward DFIG activities for the ON-channel are shown in Figure 4e and Figure 4f, respectively; the OFF-channel DFIG activities are the same except that their locations correspond to locations of increasing darkness. Kotice the systematic compression in the ON- and OFF-filling-in layers, which occurs with a succession of steps in the same direction. Each brightness increase in the ON-channel, Figure 4g, is increasingly small (signal compression), but the same effect is occurring in the OFF-channel, Figure 4h, where each increase in darkness is successively smaller. When these two nonlinear channels are additively combined, the systematic biases tend to cancel! This suggests a new theoretical reason for the existence (teleology) of parallel ON- and OFF-channels. The result is a more accurate, linear brightness prediction by the DFIG-model, shown in Figure 4j. The GT88 model predicts only flat brightness steps, Figure 4i.
5.2 A Battery of Test Stimuli. Arend (1983) presents a set of luminance stimuli developed by OBrien (1958) and Cornsweet as a test set for assessing brightness perception models. These stimuli are contained in Figure 5 together with the Tolliurst stimulus, Figure 5g, and a bull's eye stimulus, Figure 31. The luminance stimuli and the associated human psychophysical brightness percepts are shown in the first two columns; the last two columns show the brightness predictions of the GT88 and DFIG models, respectively. Each of the Arend (1983) brightness predictions in the DFIG set and in the GT88 set was scaled as a group so the abscissa and ordinate values are the same for each. The eight top rows, Figure 5a-h, show correct brightness predictions by both models. The last four rows, Figure 5i-1, show cases where only the DFIG model makes acceptable brightness predictions. Figure 51 shows a saw-tooth stimulus that produces a bull's eye brightness percept (Arend et al. 1971). Here, the gain of the GT88 brightness prediction has been amplified to illustrate an interesting brightness inversion that can occur in the center ring. This inversion was considered a success by Grossberg and Todorovii-, who argued that when using small patterns on large backgrounds, their informal psychophysical observations were in the same direction as the simulation results. However, Arend could not find such effects when he used larger stimuli.' ' S e e Grossberg and Todoro\.ii- (1988) pp. 261-262 for a discussion of this
Directional Filling-in
313
(a) Stimulus
(h) FCS signal
A (c) BCS signal
(d) Diffusion
(e) Right ON-DFIG
(0Left ON-DFIG
v
(9) ON Fill
(h) OFF Fill
(i) GT88-model J
0') DFIG-model c
Figure 4: Experiment using staircase luminance stimulus (pyramid). The stimulus consists of equal ratio luminance increments. The rightward and leftward DFIG activities are shown for the ON-channel. Notice how the compressive nonlinear activations in the ON- and OFF-filling-in layers cancel to produce a linear brightness increase percept in the DFIG model. 6 Discussion
The results show that DFI provides accurate perceptual brightness predictions for a wide variety of luminance stimuli, particularly where transitive luminance distributions exist. Moreover, this is accomplished at a single spatial scale! The model requires, and thus provides rationale for, separate ON- and OFF-channels. Next, the distinction between the DFI theory and possible DFIG model implementations is elaborated. Finally, a comparison between DFI and alternative approaches to the problem of brightness prediction is discussed. 6.1 Implementation Issues. This study was designed to test the theory of DFI. To facilitate a focused study of the DFIG behavior across transitive luminance distributions, the current implementation used DFIG receptive and projective fields limited to a single cell directly adjacent
Karl Frederick Arrington
314
Stimulus
Percept
GT88
DFIG
Figure 5: A battery ot test stimuli. Various luminance stimuli and the associated human psychophysical brightness percepts are shown in the first two columns. The last two columns s l i c n ~the brightness predictions of the GT88 model and the DFIG model, respectively. The DFIG model performs well for all stimuli, whereas the GT88 model performs Xvell only for (a) through (h).
to the boundary. In general, the DFIG projection field should coincide with the spatial scale of the FCS, which would allow thick boundaries and slower diffusion to be reinstated, which in turn would allow the DFI inodel to count the Mach Band effects and the Chevreul illusion effects cts successes just as the GT88 iiiodel did. The DFI theory is not wedded to the particular mechanism described here. The DFIC is only one of a number of local mechanisms that can affect global brightness perception. One alternative mechanism could employ facilitation of the FCS cells that project to the filling-in layer, such that the gain of the FCS signal is proportional to the activation of
Directional Filling-in
31 5
the adjacent filling-in region. By using the same FCS projection field, the DFI and FCS spatial scales are guaranteed to be the same. 6.2 Comparison with Alternative Approaches. Previous attempts to deal with the transitivity problem have involved symbolic rule based systems such as MIRAGE (Watt and Morgan 1985) and MIDAAS (Kingdom and Moulden 1992). The DFIG model can in some ways be considered a neural implementation of the symbolic brightness rule for steps. There are several other possible solutions to the problem of transitivity. One solution may be to combine contrast information obtained at multiple spatial scales. It is clear that multiple spatial channels operate in parallel in the visual system (Wilson et a/. 1990) and they have been qualitatively discussed by Grossberg (1987a). When Arend made a critique of an early version of the BCS/FCS (Grossberg 1983) because of its inability to handle all of the cases in the test set, Grossberg used a multiple spatial scale justification as a rejoinder; however, simulations have yet to appear. It should be noted that the Kingdom and Moulden (1992) model used multiple spatial scales, but symbolic brightness step rules were still required to handle the transitive cases. Another solution may be to allow some direct-intensity information, as well as the contrast-intensity information. Some researchers believe that the retinal ganglion cells may transmit at least a little direct-intensity information in addition to the contrast-intensity information that most strongly affects them. This type of absolute intensity information has been used in resistive grid models to solve a related problem-namely, because the Land (1986)algorithm operates under a gray world assumption: one obtains grayness from large uniform fields (e.g., the sky) and sudden appearance of color when objects appear (e.g., a few birds fly over). Within the framework of the retinex model, Moore and colleagues have employed a homomorphic-filter transfer function that is itself a function of local "edginess" in the stimulus, so that contrast information is used when it is available; otherwise direct information is used (Moore et a / . 1991a,b). Direct intensity information has been incorporated into a variant of the BCS/FCS developed by Neumann (19931, who points out that the shunting equations (equations 4.1 and 4.2) allow for a scaled low-pass filter encoding of stimulus luminance distributions, as well as providing saturation levels for the DOG contrast response. Neumann argues that the ON- and OFF-pair provide multiplexed contrast (polarity) and luminance information. His model has demonstrated some successes with actual luminance steps; however, it does not appear that this or any direct intensity model can ever account for the brightness illusions such as the brightness staircase perception from multiple cusps. It is still far from certain that sufficient direct intensity information is actually transmitted to the brain, and retinal stabilization experiments argue against its significance.
Karl Frederick Arrington
316
7 Conclusion
It has been demonstrated that the directional filling-in (DFI) extension to traditional filling-in theory provides more accurate predictions of perceptual brightness from a variety of luminance stimuli, particularly with the class of stimuli that has successive luminance steps or cusps in the same direction. The entire set of brightness experiments described by Arend et nl. (1971) for assessing brightness models is simulated here for the first time using the BCS/FCS model with the DFIG augmentation; moreover, it is accomplished within a single spatial scale. Finally, the DFI theory provides a new teleology for the parallel ON- and OFF-channels. DFI is of course not limited to brightness perception-it should apply equally well to any feature that is perceived to fill in, including color, depth, and texture. 8 Appendix: Parameters
The parameters in parentheses refer to the equations in Grossberg and Todorovit (1988). The GT88 parameters used in all simulations shown in this paper are (A) P, = 1; (B) D , (L) L
= 90; (C)
100; 7 = 1; kl
= 60;
(E) E
= 0.5;
1; ( P ) X(surround) = 8; (MI PS = 10; 6 = 10; k2 = 1; 19 = 5.
= 5; (CY)X(center) =
E =
C = 4; (D) H ,
=
IOO,OOO;
The DFIG parameters used in all simulations shown in this paper are (A) P , = 0.1; (B) D, = 2.5; (C) C = 1.0; (D) H I = 1.0; (E) E = 2.5; (L) L = 0.001;(0)X(center) = 1; (13) X(surround) = 8; (M) Ps = 1; 6 = 500,000; E = 500,000; 7 = 1; kl = 1; k2 = 0.0001;8 = 1; k f = 10; Ou,y = 0.0; HUB = 0.02. In Figure 5, the width of the input layer and the neural fields were all 150 processing units (cells), except for Figure 51 that used a width of 118. The stimulus in Figure 51 was constructed to illustrate an interesting brightness inversion that can occur in the GT88 model in certain input parameter ranges. Since the brightness percepts in Figure 5i and 5j are the same and represent half of a rotating disk, the percept for Figure 51 was obtained by flipping one and concatenating it to the other. Acknowledgments I would like to thank Stephen Grossberg, Michael Cohen, Ennio Mingolla, and Richard Held for their support and encouragement. The work described in this paper was supported by a grant from the Office of Naval
Directional Filling-in
317
Research, ONR N00014-91-J-4100, while in the Cognitive and Neural Systems Department Ph.D. program a t Boston University (Arrington 1993), a n d through a fellowship from the McDonnell-Pew Center for Cognitive Neuroscience at MIT (Arrington 1994b).
References Arend, L. 1983. “Filling-in” between edges. Behav. Brain Sci. 6, 657-658. Arend, L., Buehler, J. N., and Lockhead, G. R. 1971. Difference information in brightness perception. Percept. Psychophys. 9(3B), 367-370. Arrington, K. F. 1993. Neural network model of color and brightness perception and binocular rivalry. Unpublished Ph.D. Thesis, Cognitive and Neural Systems Department, Boston University, Boston, MA. Arrington, K. F. 1994a. The temporal dynamics of brightness filling-in. Vision Res. 34(24), 3371-3387. Arrington, K. F. 1994b. Visual feature-flow using directional ionic-gates. ARVO Abstr. Invest. Ophthalmol. Visual Sci. 35(4, suppl.), 2005. Cohen, M. A., and Grossberg, S. 1984. Neural Dynamics of brightness perception: Features, boundaries, diffusion, and resonance. Percept. Psychopkys. 36(5), 428456. Cornsweet, T. N. 1970. Visual Perception. Harcourt Brace Jovanovich, New York. De Weerd, P., Gattas, R., Desimone, R., and Ungerleider, L. G. 1993. Centersurround interactions in areas V2/V3: A possible mechanism for filling-in? Neurosci. Abstr. 19. Gerrits, H. J. M., and Timmerman, G. J. M. E. N. 1969. The filling-in process in patients with retinal scotomata. Vision Res. 9, 439442. Gerrits, H. J. M., and Vendrik, A. J. H. 1970. Simultaneous contrast, filling-in process and information processing in man’s visual system. Exp. Brain Res. 11, 411430. Gerrits, H. J. M., de Haan, B., and Vendrik, A. J. H. 1966. Experiments with retinal stabilized images. Relations between the observations and neural data. Vision Res. 6, 427-440. Grossberg, S. 1983. The quantized geometry of visual space: The coherent computation of depth, form, and lightness. Behav. Brain Sci. 6, 625-692. Grossberg, S. 1987a. Cortical dynamics of three-dimensional form, color and brightness perception: I. Monocular theory. Perception Psychophysics 41 (2), 87-116. Pagination references are to the reprinted version in Grossberg, S. (ed.) 1988. Neural Networks and Natural Intelligence, Chap. 1, pp. 1-54. MIT Press, Cambridge, MA. Grossberg, S. 198%. Cortical dynamics of three-dimensional form, color and brightness perception: 11. Binocular theory. Percept. Psychophys. 41(2), 117158. Pagination references are to the reprinted version in Grossberg, S. (ed.). 1988. Neural Networks and Natural Intelligence, Chap. 2, pp. 55-126. MIT Press, Cambridge, MA. Grossberg, S., and Todorovie, D. 1988. Neural dynamics of I-D and 2-D bright-
318
Karl Frederick Arrington ness perception: A unified model of classical and recent phenomena. Per~ e , ~Psycliopl1!/s. t. 43, 241-277.
Kingdom, F., and Moulden, 8.1992. A multi-channel approach to brightness coding. Vision Res. 32(8), 1565-1582. Krauskopf, J. 1963. Effect of retinal image stabilization on the appearance of heterochromatic targets. I. Opt. Soc. A m . 53(6), 741-744. Land, E. H. 1986. Recent advances in retinex theory. Vision Res. 26(1), 7-21. Moore, A,, Allman, J., and Goodman, R. M. 1991a. A real-time neural system for color constancy. I E E E Trails. Neural Networks 2(2), 237-247. Moore, A., Fox, G., Allman, J., and Goodman, R. 1991b. A VLSI neural network for color constancy. In Advances irz Neural lnforniatiorz Processing Systeitis 3, D. S. Touretzky and R. Lippman, eds., IEEE Conference Proceedings, Nov. 26-29, 1990. Morgan Kauffman, San Mateo, CA. Neumann, H. 1993. Toward a computational architecture for unified visual contrast and brightness perception: I. Theory and model. Proc. World Corigrrss Ncwral Networks (WCNN’93) July 11-15, (I) 84-91. OBrien, V. 1958. Contour perception, illusion and reality. 1. Opt. Soc. Am. 48, 112-1 19 (re-referenced). Ramachandran, V. S., and Gregory, R. L. 1991. Perceptual filling in of artificially induced scotomas in human vision. Natrrre (Loiiiloiii 350, 699-702. Ramachandran, V. S., Gregory, R. L., and Aiken, W. 1992. Perceptual fading of visual texture border. Vision Rrs. 33(5/6), 717-721. Stoper, A. E., and Mansfield, J. G. 1978. Metacontrast and paracontrast suppression of a contourless area. Vision Res. 18, 1669-1674. Tolhurst, D. J. 1972. On the possible existence of edge detector neurons in the human visual system. Visiori RCS.12, 797-804. Watt, R. J., and Morgan, M. J. 1985. A theory of the primitive spatial code in human vision. Visiorr Res. 25(11), 1661-1674. Wilson, H. R., Levi, D., Maffei, L., Rovamo, J., and DeValois, R. 1990. The perception of form. In Visual Perceptiori: The Neurophysiological Foundations, L. Spillmann and J. S. Werner, eds., Chap. 10, pp. 231-272. Academic Press, New York. Yarbus, A. L. 1967. Eye Moveriieiits a i d Visiori. Plenum Press: New York.
Received November 18, 1994; accepted July 18, 1995.
This article has been cited by:
Communicated by Bard Ermentrout
Binary-Oscillator Networks: Bridging a Gap between Experimental and Abstract Modeling of Neural Networks Wei-Ping Wang Department ofMathematics, University ofNorth Carolina, Chapel Hill, NC 27599 U S A
This paper proposes a simplified oscillator model, called binary-oscillator, and develops a class of neural network models having binary-oscillators as basic units. The binary-oscillator has a binary dynamic variable u = k l modeling the "membrane potential" of a neuron, and due to the presence of a "slow current" (as in a classical relaxation-oscillator) it can oscillate between two states. The purpose of the simplification is to enable abstract algorithmic study on the dynamics of oscillator networks. A binary-oscillator network is formally analogous to a system of stochastic binary spins (atomic magnets) in statistical mechanics. 1 Introduction Some recent findings in the visual cortex suggest that synchronized oscillatory activity among populations of neurons may be responsible for global coding and local-feature linking of a visual scene (cf. Eckhorn et al. 1988; Gray ef al. 1989). A more profound issue of compositionality raised by Bienenstock and Geman (1995) also suggests that synchronized oscillatory activity may provide a solution to problems like representing a scene containing a red triangle and a blue square. To gain an insight into such synchronized oscillatory activity, it is natural to study the dynamics of large networks of mathematical oscillators. A mathematical oscillator is a dynamic system (of continuous time) that has a unique attractor, and this attractor is a periodic orbit. The most popular oscillator goes under the name of van der Pol (19261, of which various variations have been successfully used as mathematical models of a large class of neurons. In what follows, we shall refer to oscillators that can be defined by vector fields (differential equations) on a two-dimensional Euclidean space as "van der Pol-type oscillators." At the level of neural networks, the van der Pol-type oscillators have advantages and disadvantages. If the network contains only a few neurons (a dozen or so, for instance) so that numerical integration of the corresponding differential equations is in a reasonable amount, then a van der Pol-type oscillator network can match the corresponding real biological system to a surprisingly high degree (cf. Rowat and Selverston Neurnl Coinputatiori 8, 319-339 (1996)
@ 1996 Massachusetts Institute of Technology
Wei-Ping Wang
320
1993). Therefore, such an oscillator network can serve as a mathematical model of a real biological neural system. This is already a great simplification, compared to the complexity of the real biological condition (cf. Nicholls rt (11. 1992). The disadvantage of van der Pol-type oscillators is their inconvenience tor abstract algorithmic study, especially for large networks. If we take an oscillator network to be a computational tool, then our main concern w~iulclbe its general functional role rather than detailed match to a real biological system. A number of investigators proposed simplified models or simplifying methods under various circumstances, among which the most significant ones are Bklair and Holmes (1984) and Kopell (19881, which we summarize as follows: BPlair and Holmes (1984) studies a pair of coupled piece-wise linear \ran der Pol oscillators in the relaxation limit. Applying a general scheme developed by Smale (1972) and Takens (1976)' they reduced the dynamics to a drift on a geometric constraint plus jump conditions at critical submanifolds. This model can be generalized to networks consisting of more than two oscillators. The advantage of this model is the piece-wise linear structure, which allows explicit analytic study. The disadvantage comes from the fact that the geometric constraints depend on the coupling weights (also external inputs, if to be added), so the more global constraints always destroy subconstraints. This is inconvenient in the study of large networks. Kopell (1988) proposed studying oscillator networks by phase equations, which can be obtained from an averaging scheme. When the basic units are van der Pol-type oscillators, this scheme reduces the system's dimension by half. This theory is appropriate for the study of weakly coupled, or weakly forced oscillators. However, if we are considering the effect of a large class of external inputs (with a large variety of complexity, and not necessarily very small in amplitude; e.g., a set of visual images) on the oscillator network, i t . , if the network is actively computing something in reaction to a large class of external inputs, then the averaging method is inadequate. I n fact, for such a situation, not only the phase relation is important, but also the wave forms of each oscillator (cf. Wang 1994). The idea of the present paper is to bring together the useful dynamics of van der Pol-type oscillators and the simplicity of the binary units as used in Hopfield (1982) and Hinton and Sejnowski (1983, 1986; see also Hertz st nl. 1991 for a more complete list of references). As a result, we propose the following "binary-oscillator": 11
:= s g n ( z ? - q i A )
(1.1)
(see Section 2, Subsection 3.3, and Section 4, where --x < X < x and i 1 are parameters, and "sgn" is the sign function. The dynamic
m,
Binary-Oscillator Networks
321
variable v takes only two possible values: 1 and -1, and is meant to model the ”membrane potential” of a neuron; whereas the variable 9 models a slow current. When X = 0, we obtain the basic oscillator that is analogous to the van der Pol oscillator. There is a formal link between the binary-oscillator and the stochastic binary unit (an atomic magnet) in the Boltzmann machine neural networks model (cf. Hinton and Sejnowski 1983, 1986; see also Hertz et al. 1991 for a more complete list of references): Let T = g s - 1, and let (v(t),q(t))be an arbitrary solution of the system 1.1-1.2. Then after a transient time, we have
v(t)= -1
I
v ( t ) = i l interchangeably
1
v ( t )= 1 +
A<-T
I
-T<X
1
X
(1.3)
X>T
(see Subsection 3.3 and Section 4), which means that if X 5 -T, then the ”membrane potential” v ( t ) will be observed at the hyperpolarizing state v = -1; if -T < X < T, then v(t) will be observed taking values 1 or -1 interchangeably in a periodic fashion; and if X 2 T, then v ( t ) will be observed at the depolarizing state z, = 1. Only for -T < X < T, the system 1.1-1.2 is a real oscillator, and the oscillation for X near -T (respectively T ) distributes more time to the state z, = -1 (respectively v = 1) than u = 1 (respectively u = -1). If we formally identify X with an external magnetic field, then the binary ”membrane potential” v ( t ) is formally analogous to a stochastic binary spin, and T is formally analogous to the temperature. As T + 0, 1.3 turns into a sharp threshold relation between u and A; and this is comparable with the zero temperature limit of a stochastic binary spin. It is noteworthy that the binary-oscillator is different from the piecewise linear van der Pol oscillator E X + d(x)X + x = 0, in which $ ( x ) = 1 if 1x1 > 1, and 4 ( x ) = -1 if 1x1 < 1. This oscillator was studied by Levinson (1949), Levi (1981), and Belair and Holmes (1984) under different circumstances. The binary valued function $(x) makes the differential equation a piece-wise linear one, but it is not a dynamic variable. The dynamic variables are x and X, and the corresponding phase space is {(x.X)} = R x R, the two-dimensional Euclidean space. None of the dynamic variables (even after any kind of coordinate transformation) in the piece-wise linear van der Pol oscillator is binary valued. In contrast, the binary-oscillator 1.1-1.2 has a binary dynamic variable u = f l , and its phase space is { (v,q ) } = (1, -1 1xR, which is a one-dimensional manifold consisting of two copies of real lines. Several variations of the basic binary-oscillator are given in Section 3, to incorporate biological characteristics. General binary-oscillator networks are described in Section 4. Let B(A) denote the dynamic system 1.1-1.2, ( s . 9 , ) (i = 1.2.. . . , n ) denote the physical state of the ith unit in a binary-oscillator network, and let
322
Wei-Ping Wang
L?,(A) denote the system 1.1-1.2 with (u.q) = (u,,q,).Then the network
activity observed at the ith unit is represented by (1.4) where wII is the coupling weight, and el is the external input. Equations l .4 plus 1.3 make a binary-oscillator network formally analogous to a system of stochastic binary spins in statistical mechanics (of course, they are conceptually different). In a binary-oscillator network, the phase space ((v?q)} = (1. -1} x X for each unit is preserved by network couplings and external inputs. This avoids the disadvantage of the geometric constraints of Belair and Holmes (1984) mentioned earlier. In Section 5, we illustrate the dynamics of binary-oscillator networks by a simple two-cell reciprocal-inhibitory model. The geometric method used in the analysis provides an intuition on how a general initial state settles into the course of out of phase oscillation. The results of this example show that the binary-oscillator network dynamics, at least in the case of a two-cell model, is consistent with the existing experimental and analytical results obtained from other oscillator models. A conceivable goal of the study of binary-oscillator networks is to model and analyze certain uniformity in the structure of the brain. Mumford (1991,1992) provides a global view on such uniformity, and suggests that some simple general principles of organization must be at work. In a subsequent paper (Wang 1994) an attempt toward this direction is made. The paper (Wang 1994) studies a network formed by a two-dimensional array of binary-oscillators, and presents a mathematical theory of hierarchical representation of perceived objects. 2 The Basic Binary-Oscillator
Our basic binary-oscillator is defined by the dynamic rule:
u := sgn(v-9)
3 Lit
=
-(q
-
csu)
(2.11 (2.2)
where the notation ”:=“ is adopted from Hertz et nl. (1991) to mean updating, the real number ns > 1 (slow current gain) is a parameter and is fixed, and sgn(.) is the sign function:
The phase space of 2.1-2.2 is ((u.9)) = (1. -1} x R, which is a onedimensional manifold consisting of two copies of real lines. The binary variable Z J = *1 models qualitatively the membrane potential of a neuron,
Binary-Oscillator Networks
L
I
'
I
-
I
I
I
I
I I
I
I '1; I
II
I
~ 1I
323
b
1
-
I
i
I
I
I
t i t iA
i-
1
-1
I
I I I
I
1
1
I1
I
II
I
I
I
I
I I
--
I
- P
I
I
I
I
-
I
-
Figure 1: Phase portrait of the binary oscillator dynamics 2.1-2.2. and q models a slow current. For each initial state (71'. qo), we first update the "membrane potential" u according to the updating rule 2.1 and then run the q dynamics according to 2.2 until v needs to be updated again. As an example, let us consider the case when (uo.9') = (1.4.22). Since vo-qo = 1-4.22 < 0, the updated membrane potential is v = -1 according to 2.1. Now a,u = -os, and q(t) - (-as)= [qo - ( - ~ ~ ) ] eaccording -~ to 2.2. The 9 dynamics reduces the q value toward -gS. Since os 2 2, the variable q = q ( t ) will meet the value -1 before it reaches -os. At the time when q = -1, the membrane potential is again updated, and we have v = 1. From then on, the variable q will bounce back and forth periodically between -1 and 1. A global phase portrait is shown in Figure 1. Since the initial states (1,9°)qn>, and (-1, 9°)q~5-1are immediately updated to (-1. qo) and (1.qo), respectively, we can make the identification (1.9°)q0z,= ( - l . q o ) >
(-l,qo)qo5-l
= (1>q0)
(2.3)
This turns the phase portrait in Figure 1 into a portrait on a branched manifold as shown in Figure 2. Apparently the disjoint union { (v.q ) I ZI = 1, -1 5 q 5 l} u { ( u > q I) v = -1, -1 5 q 5 1} is a periodic attractor (in fact, the unique attractor) of the dynamics 2.1-2.2. There is a subtlety in the dynamics 2.1-2.2 when the initial data is (uo.qo)= (1.1). Since vo - qo = 0 and sgn(0) = 1, the membrane potential z, is not changed before the q dynamics is turned on. However, once q = q ( t ) starts moving toward a,, instantly we have v-9 < 0, which forces u to be updated to -1. This is equivalent to the instantaneous updating
Wei-Ping Wang
324
Figure 2: The reduced phase portrait of the dynamics 2.1-2.2. The q-axis is branched into two identical copies between -1 and 1.
(v', 9') = (1.1) H (-1.1). If the reader is annoyed by this subtlety, one can replace the function sgn(.) by sgn*(.;u),which is defined by sgn*(x;v ) =
-1 1
if v if v
-1 and x < 0, or u = 1 and x 5 0 (2.4) = -1 and x 2 0, or v = 1 and x > 0 =
The updating rules u := sgn(-u- 9 ) and u := sgn*(v- q;uj are equivalent when combined with 2.2. The remarkable similarity of our binary-oscillator to the van der Pol oscillator is worth mentioning. The latter has become popular in neural networks modeling since the works of FitzHugh (1961) and Nagumo et al. (1962); and an illuminating implementation of such modeling can be found in Rowat-Selverston (1993). The van der Pol oscillator is defined (in a form) by the differential equations
f ( v ) = --u
3
+u
(0 < r,, << T~ << 1). Figure 3 shows the phase portrait of the van der Pol oscillator dynamics. Within the limit cycle, a substantial amount of time is distributed to the upper and lower portions; and the transition between them is nearly instantaneous. A detailed analysis on the van der Pol equations can be found in Hirsch-Smale (1974). The similarity to the van der Pol oscillator, which models the dynamics of an electrical circuit, makes our binary-oscillator-based neural networks convenient for hardware implementation.
Remark. It is noteworthy that the binary-oscillator is different from the piece-wise linear van der Pol oscillator fx + $ ( x ) x + x = 0, in which 4(xj = 1 if 1x1 > 1, and $(x) = -1 if 1x1 < 1. This oscillator was studied by Levinson (1949), Levi (19811, and Belair and Holmes (1984) under different circumstances. The binary valued function $(x) makes the differential equation a piece-wise linear one, but it is not a dynamic variable.
Binary-Oscillator Networks
325
Figure 3: The phase portrait of the van der Pol oscillator. The dynamic variables are x and x, and the corresponding phase space is { ( x x)} ~ = R x R, the two-dimensional Euclidean space. None of the dynamic variables (even after any kind of coordinate transformation) in the piece-wise linear van der Pol oscillator is binary valued. In contrast, the binary-oscillator 2.1-2.2 has a binary dynamic variable u = f l , and its phase space is { ( u , q ) } = {1,-1} x R, which is a one-dimensional manifold consisting of two copies of real lines. 3 Variations of the Basic Binary-Oscillator
Parameters can be added and adjusted in our basic oscillator 2.1-2.2 to incorporate biological characteristics. 3.1 Frequency Adjustment. Adding a coefficient rs > 0 to the term d q / d t in equation 2.2 will adjust the frequency. A smaller (larger) rs value results in a higher (lower) frequency. The frequency adjustable binaryoscillator is thus given by
u
:= sgn(u-q)
(3.1)
Figure 4 shows the recordings of v versus time t corresponding to three parameter values rs = 0.5, rs < 0.5, and r, > 0.5.
326
Wei-Ping Wang
A
B
C
Figure 4: Frequency adjustment. Recordings of (b) T~ < 0.5, (c) T~ > 0.5.
D
versus time t . (a) T~ = 0.5,
The frequency of the oscillator 3.1-3.2 can be explicitly calculated. The variable q bounces back and forth between -1 and 1. Let and P denote, respectively, the frequency and the minimal period. Then from 3.2, it follows that LJ
1 - gs
--
(-1 - c s ) e - w d
(3.3)
which implies that (3.4) There are other ways to adjust the frequency. One may keep the equation 2.1 and adjust the parameter os in 2.2. Or one may keep the equation 2.2 and add a coefficient p > 0 to the variable q in 2.1, i.e., ZI := sgn(v - pq).
(t
(3.5) -5)
Since the adjustment 3.1-3.2 yields a linear relation between the minimal period P and the time constant rs as shown in 3.4, it seems to be the most convenient one.
Binary-Oscillator Networks
327
U n I I
n I
I
I
I I
I
1
I
I
I 1
I 1
I
1
I
I
I
1
1 I I
n
I I
-I
1
I
t
I I
Figure 5: “On” and ”off” proportions adjustment. Recording of u versus time t , where 7; > r$. 3.2 ”On” and ”Off” Proportions Adjustment. If u = 1, we say that the system 2.1-2.2 is at ”on” state; otherwise it is said to be at ”off” state. Let (v.9) = (v(t), q(f))-,
(3.7) The derivation of the above formulas is similar to that of 3.4. If :7 > T ~ - , the ”on” state is dominant; if rs- > r:, the ”off” state is dominant. Figure 5 shows such biased dominance. As in the case of frequency adjustment, there are other ways to adjust the ”on” and “off” proportions. But the adjustment described above has the advantage that Pi and P- are linearly related to :r and T ~ respectively, as shown in 3.7. 3.3 Polarized Oscillations and Quiescent Neurons with Oscillation Potential. Now let us embed the basic binary-oscillator 2.1-2.2 into a one-parameter family of dynamic systems u := s g n ( v - q + A)
?dt ! f
=
-(9 - 0,u)
(3.8) (3.9)
~
,
Wei-Ping Wang
328
where X is the parameter, and as > 1 is fixed. When X = 0, we obtain the basic binary-oscillator 2.1-2.2. As X varies from -m to m, the system 3.8-3.9 shows different dynamic behaviors that can be used to model certain biological properties. If 0 < / A / < us- 1, the system 3.8-3.9 still oscillates; but the oscillation (of q ) is polarized, i.e., centered at X instead of 0. Figure 6a depicts such a polarized oscillation. The polarized oscillation distributes different amounts of time to the states u = 1 and zi = -1 during each period. Let PA denote the minimal period of the polarized oscillation, and let P: and P; denote, respectively, the time distributed to the states u = 1 and zi = -1 during each time interval [f. t PA]. Then
+
a,-I-X
a,-l+X
)
(3.10)
+
If -(a, 1) < X 5 -(as - l), the system 3.8-3.9 is quiescent, but exhibits postinhibitory rebound (cf. Rowat and Selverston 1993). (u.q ) = (-1, -as) is a fixed point attractor. Every initial state (u",qo)eventually settles down to this attractor. A postinhibitory rebound is observed if u" = -1 and qo 5 X - 1: As time f increases from zero, we first have ?,(Of) = 1 and q(0+) = qO; then with u ( t ) = 1, the value q ( t ) increases through -as, and reaches X 1; after this, with u ( t ) = -1, the value q ( t ) bounces back to -cs.Figure 6b depicts such a postinhibitory rebound. If os - 1 5 X < cs 1, the system 3.8-3.9 is quiescent, and exhibits postburst hyperpolarization (cf. Rowat and Selverston 1993). (u.q ) = (1.a,)is a fixed point attractor. Every initial state (zJ'. qo) eventually settles down to this attractor. A postburst hyperpolarization will be observed if u" = 1 and qo 2 X + 1: As time t increases from zero, we first have v ( O + ) = -1 and q ( O + ) = 4"; then with u ( t ) = -1, the value q ( t ) decreases through ah, and reaches X - 1; after this, with I J ( ~ ) = 1, the value q ( f ) bounces back to oh.Figure 6c depicts such a postburst hyperpolarization. If 1x1 2 c, + 1, the system 3.8-3.9 is super quiescent. Every initial state ( z i " , q') straightforwardly settles down to the fixed point attractor (-1, -ah)or (1.a,) depending, respectively, on X < 0 or X > 0. Figure 6d shows such a situation. The quiescent system (3.8-3.9)1~1~~,-~ has oscillation potential. If a suitable external input is added to the system, e.g., input = A, then the system becomes an oscillator again:
+
+
sgn(u - q - X
+ input) = sgn(v - 9 )
(3.11)
4 Networks of Binary-Oscillators
From Section 2 and Subsection 3.3, we see that the basic binary-oscillator 2.1-2.2 can be embedded into a one-parameter family of dynamic systems Zl
:= sgn(v - q
+ A).
X = parameter,
(4.1 1
329
Binary-Oscillator Networks
V
lJ-
A 1
I
I
-I
U
I
C
A;;'
0, htl
1
I
-I
I
-1
0
Figure 6: Left: phase portraits. Right: time recordings of u starting from t = 0. (a) 0 < 1x1 < os - 1. Polarized oscillation. The figure shows the case 0 < X < os - 1, and the oscillation distributes more time to the state u = 1 than u = - 1 . If -(os - 1) < X < 0, then the oscillation will distribute more time to the state u = -1 than u = 1. (b) -(05+1) < X 5 -(os-l). The system 3.8-3.9 is quiescent, but exhibits postinhibitory rebound. (c) os- 1 5 X 5 os+ 1. The system 3.8-3.9 is quiescent, but exhibits postburst hyperpolarization. (d) / X I 2 os + 1. Super quiescent. The figure shows that case X > os+ 1. If X < -(us + 1), then we have u ( f )= -1 for all t > 0.
(4.2) Let (v.q) = (v(f)>q ( f ) ) be an arbitrary solution of 4.1-4.2, and let T = C T - ~ . If we formally identify X with an external magnetic field, then v(t) is
Wei-Ping Wang
330
formally analogous to a stochastic binary spin (an atomic magnet) in a generalized king model of statistical mechanics (cf. Hinton and Sejnowski 1983, 1986 and Hertz et 01. 1991, pp. 25-29), and T is formally analogous to the temperature. Such analogies are based on the following diagram
z J ( t )= -1
1
X 5 -T
u ( t ) = &1 interchangeably
-T<X
1
Z J ( ~ )
=
1
x
(4.3)
X>T
(see Subsection 3.3), which means that after a transient time, if X 5 -T, then the "membrane potential" u ( t ) will be observed at the hyperpolarizing state zi = -1; if -T < X < T , then ~ ( t will ) be observed taking values 1 or -1 interchangeably in a periodic fashion; and if X > T, then zl(t) will be observed at the depolarizing state z i = 1. Only for -T < X < T, the system 4.1-4.2 is a real oscillator, and the oscillation for X near -T (respectively T ) distributes more time to the state z1 = -1 (respectively zi = 1) than zi 1 (respectively u = -1). See 3.10 for the precise time distribution to the states u = fl. As T + 0, 4.3 turns into a sharp threshold relation between z1 and A; and this is comparable with the zero temperature limit of a stochastic binary spin. Due to the analogy between the binary-oscillator 4.14.2 and the stochastic binary spin, we try to develop a binary-oscillator networks model that is formally parallel to a generalized Ising model of statistical mechanics. Let B(X) denote the dynamic system 4.14.2, ( u l , q l )( i = 1 , 2 . .. . 1 2 ) denote the physical state of the ith unit in a binary-oscillator network, and let & ( A ) denote the system 4.1-4.2 with (z7.q) = (ul,ql). Then the network activity observed at the ith unit is represented by 1
%
where zul, is the coupling weight, and el is the external input. Equations 4.4 plus 4.3 make a binary-oscillator network formally parallel to a system of stochastic binary spins in statistical mechanics (of course, they are conceptually different) (cf. Hinton and Sejnowski 1983, 1986 and Hertz et al. 1991). The expression 4.4 can be written out explicitly as (4.5)
where i = 1.2. . . . . n, and n is the number of neurons in the network. To run the network dynamics, we first update the membrane potentials z7, according to 4.5, and then integrate the slow currents q1 according to 4.6 until at least one of the u,s needs to be updated again.
Binary-Oscillator Networks
331
Remark 4.1. It is convenient to assume the conditions crs 2 2 and (4.7)
It keeps the system from being trapped into endless updating by 4.5.
Remark 4.2. If we want to remove the condition 4.7 to consider general connection weights and external inputs, then a (nearly negligible) updating time should be assigned to the system, so the slow dynamics 4.6 does not stop while the u,s are being updated. This ensures that the dynamics will not stop at finite time. Remark 4.3. Each time we update the membrane potentials u,,a special order must be followed. We should first search out all the u,s that need to be reversed, and have them reversed at the same time. This first action may cause the need for some of the remaining u,s to be reversed. Again, search them all out, and have them reversed at the same time. This process can be kept ongoing, and will last for at most n steps. The condition 4.7 prevents the updating from being circular. Remark 4.4. As in the case of the basic binary-oscillator, the initial condition $ = 1 plus up - 4: wt,v/”+ e, = 0 is subtle. To avoid this subtlety we can replace the function sgn(.)by sgn*(.;ul),which is defined by 2.4. For theoretical study, these two functions work equivalently with the dynamics 4.54.6. For computer simulation, such a replacement is strongly recommended.
+ c;=,
Remark 4.5. Slow synapses can also be incorporated in the binaryoscillator network. Let I:k denote the input from the kth neuron to the ith neuron conducted by a slow synapse with strength ufk. Then the network dynamics with both fast and slow synapses is defined by (4.8) (4.9) (4.10)
where r is a time constant. Note: In general, the index k in 4.8 and 4.10 takes values only from a subset of (1 2, . . . n } . This allows the absence of slow synapses between some pre- and postsynaptic cells. ~
.
5 Phase Locking of a Reciprocal Inhibitory Pair of Binary-Oscillators
In this section we discuss a concrete example, showing that the dynamics of a binary-oscillator network is consistent with the existing experimental
Wei-Ping Wang
332
and theoretical results under similar circumstances, so it makes sense to adopt binary-oscillator networks as abstract models of neural networks. When two identical cells that are endogenous oscillators are connected with reciprocal inhibition of equal strength, the two cells oscillate exactly out of phase with each other. For the corresponding biological background the reader is referred to Rowat and Selverston (19931, and Wang and Rinzel (1992). A reciprocal inhibitory pair of binary-oscillators is given by
v1
:= sgn(v1 - q1 - wv2)
v2
:= sgn(v2 - q 2
dq, dt
=
-(q1
-
(5.1)
wvl)
(5.2)
-
(5.4) where
gs 2
2 and 0 < w < 1 (or us > 1 and 0 < w < os).
Main results. We are going to show that for all 0 < w < 1, the system 5.1-5.4 has exactly two periodic solutions. One of them is an attractor, of which the membrane potentials vl = zll(t) and u2 = vz(f) oscillate exactly out of phase with each other. The other one is a saddle type repeller, of which the membrane potentials ZIT = u l ( t ) and v2 = vZ(t) oscillate exactly in phase with each other. For the attractor, the Lebesgue measure of the basin of attraction is one, i.e., almost all initial states (vy.vt.qy, q!) will be attracted to the attractor. An even more appealing result that we are going to show is that the connection weight w measures how fast an initial state (vy,u;, qy, q!) approaches the attractor. Hence, very weak connection weight results in very slow convergence, i.e., for a long transient period the oscillators behave like free oscillators. Figure 7 shows the out of phase and in phase oscillations. To understand the dynamics 5.1-5.4, it is sufficient to know how the point ( q , ,q 2 ) = (ql(t),qz(t)) moves in the two-dimensional plane. From 5.3 and 5.4, it follows that
q, = gsv,+ (4:
-
a,v,)e-',
i = I,2
(5.5)
This implies that each time after the membrane potentials v1 and v2 are updated, the point (ql, 9 2 ) will travel along the straight line connecting the points (47,q!) and (osvlgSv2) until it hits some boundary when (v1,v2) needs to be updated again. It is easy to see that no matter what the initial state (vy.v!, qy. q:) is, the dynamics will soon send the point ( q l , q 2 ) = (ql(f).qZ(t))into the square OABCD(see Figure 8). Once (ql. 92) gets in OABCD, it will stay there forever and bounce back and forth in there. Figure 8 shows how a typical point (41.92) travels along a polygonal path. ~
Binary-Oscillator Networks
u
333
p
n
-
7
L
I
I
I
I
A
n
1
i
t
I
I
I
n ;
-
t
I
I
I I
I B
1
I
I
I
I
-
I I
-
I
I
I
1 I
1
I
I
I I
I
I I
I
I
d
1 I
I
-t
I I
Figure 7 : (a) Out of phase oscillation. (b) In phase oscillation. Let us now focus on the polygonal paths in the square OABCD.The bouncing boundaries are given by 5.1 and 5.2. These are the horizontal and vertical lines
q1 = u,- wu,= fl f w ,
i = 1.2,
j
= 1.2
(5.6)
In the square OABCD,each polygonal path produced by the dynamics 5.1+ + 5.4 consists of a sequence of oriented line segments 2 1 , Q 2, ct. 3 , . . . . Each d,starts at one of the bouncing boundaries and ends at another. The dynamic system 5.1-5.4 on the space ( ( ~ 1 ,v 2 , 9 1 ~9 2 ) ) induces a dynamic system on the space of d s , the admissible oriented line segments. Only the direction and the ending point of each Z'are relevant to the orbit structure of the dynamics, so we shall have any two admissible oriented line segments with the same direction and the same ending point identified as one, and name the resulting equivalence classes "bouncers.'' Each bouncer has a well-defined direction and ending point. The induced dynamic system on the set of bouncers completely determines the orbit structure of the system 5.1-5.4. The effective bouncing boundaries (where bouncing really occurs) are the closed line segments
DA, AB. BC. CD D". A'B". B", C'D"
(5.7) (5.8)
Wei-Ping Wang
334
f, 0; ItW I-W
B
B' \
I
8'
0 -Itw -I-W
1
C -I-w
-rtu
D' I-w
rtur
Figure 8 The square 2 I R C ~the , effectn e bouncing boundartes, and a polygonal path produced by the dynamlc system 5 1-5 4 in Figure 8. These are segments on the lines given by 5.6. We call the boundaries in the group 5.7 outer boundaries, and those in the group 5.8 inner boundaries. If a bouncer ends at an inner boundary, but not a t A' or C', then it is called an irirzer ~ O U J I C ~ .Similarly, ~. if a bouncer ends at an outer boundary, but not at A or C, then it is called an o u k r bozcizcer. Under the induced dynamics, each inner bouncer is followed by an outer bouncer, but each outer bouncer may be followed by either an inner bouncer or an outer bouncer. This means that if 0 denotes the set ot outer bouncers, then the induced dynamics defines a first return map
The return map R contains all the information about the orbit structure of the induced dynamics on the set of bouncers (keep in mind that an
Binary-Oscillator Networks
335
inner bouncer is always followed by an outer bouncer), and hence the orbit structure of the original dynamics 5.1-5.4. We now proceed to calculate the return map X. Let us first have the bouncers labeled, i.e., assign coordinates to the bouncers. All outer are directed at the same bouncers ending at the line segments cO or point (os.-os), so these bouncers carry the natural topology of the set u i.e., these bouncers canbe naturally identified with their end points, which belongto the set CD U DA. We can easily choose a coordinate for the set C D U DA: name it x, let D be the origin, and put on the positive part. We on the negative part of the x-axis and assume that x is measured at the same scale as 41 and 42. For example, x = 2w corresponds to the point D" and x = -2w corresponds to D'. The x-~ interval (-2 - 2w. 2 + 2w) labels the outer bouncers ending at the set CD U DA. Similarly, we assign the y coordinate to the outer bouncers - _ ending at the set A5 u BC: B is the origin, and AB is put on the negative part of the y-axis and BC on the positive part. The following bouncing diagram
m
m,
m
m \ {A}em\{C'} t!% D"E""'"w."sm DD"B;TB
(5.9)
m
shows that the return map R takes the into -outer - bouncers ending at the set of outer bouncers ending at AB U BC. Similar bouncing diagrams K,and can be constructed to show that for
m,
2w.2 + 2w), B(-2
Xla"m :
(-2
R/;1BuE :
(-2 - 2w. 2
-
+ 2w), B ( - 2
2w. 2 + 2w),
(5.10)
2w.2 -t 2w),
(5.11)
-
The maps 5.10 and 5.11 are identical if we have the x and y identified by their numerical values. Hence to calculate the return map R, it is sufficient to calculate 5.10. It is easy to see that the map 5.10 is an odd function, i.e., Xl~,,m(-x) = - X l ~ , , ~ ( x ) So . we only need to calculate R\m,,m(x)for x 2 0, which can be easily done (see Remark 1 near the end of the section) by following the bouncing sequence 5.9. The result is as follows: if 0 5 x 5 2w +Cl if2w<xIF (5.12) .(x)={ 2 c [ p + c l c2c4 x-~-)c3 ifv<x<2+2w where c1 = rs - 1 - w,c2 = os - 1 + w, c3 = crh + 1 - w, c4 = rs+ 1 + w. Figure 9 shows the graph of y = R(x). A straightforward calculation shows 4w (5.13) wt [0.2+2w)\{2w}
cZc3
Wei-Ping Wang
336
/
/+t,
ztyw
-44w 40,2v qt1-w
-2-2w /
/ /
Figure 9: The graph of y
=R(x).
Recall that the return map R is defined on the set of outer bouncers. Now it is more convenient to have the outer bouncers labeled by x paired with those labeled by y: an x-bouncer and a y-bouncer are paired if their x and y coordinates have the same numerical value. Since the maps 5.10 and 5.11 are identical if we have the x and y identified by their numerical values, the return map R preserves pairs, i.e., it maps pairs to pairs. Hence R induces a map I? defined on the set of pairs. The set of pairs can be identified with the interval (-2 - 2w.2 2w), and R(x) = RIm,m(x) for all x E (-2 - 2w.2 2w). A nice thing about I? is that it maps the interval (-2 - 2w. 2 +2ul) into itself. From 5.13 we obtain
+
+
(5.14) which implies that the map is a global contraction. By the contraction mapping theorem, we see that has a unique fixed point x = 0, which is an attractor. The attractor x = 0 for the map R represents a pair of outer bouncers: the one ending at the point B and the one ending at D. This pair of bouncers constitutes a unique periodic attractor for the return map R, which in turn corresponds to a unique periodic attractor for the dynamic system 5.1-5.4 of which the membrane potentials u1 = u1( t ) and 272 = z l 2 ( t ) oscillate exactly out of phase with each other.
Binary-Oscillator Networks
337
The quantity 4w/c2c3 in 5.14 measures how fast an initial state (v??v!? qy39:) approaches the periodic attractor. This implies that very weak connection weight results in very slow convergence, i.e., for a long transient period the oscillators behave like free oscillators. 2w) in Figure 9 corresponds to the corners B‘BB” The interval (-2w. or D’DD’’. Once a bouncer gets in these corners, it will stay there forever bouncing back and forth. Figure 9 shows that the induced dynamics first sends the bouncers into these corners, and then lets them settle down to the periodic attractor. When a bouncer bounces back and forth in these corners, the corresponding v1 = s ( t ) and u2 = v2(t) are out of phase with each other. This means that on the way an initial state (vr,vy,q r %9;) approaches the periodic attractor, the dynamics first has u1 and v 2 phase locked and then has the frequency settled down to the equilibrium frequency. Finally, let us look at the points A, C, A’, and C‘, which have been excluded from our discussion. Bouncers ending at A and C constitute a periodic orbit for the induced dynamics, and bouncers ending at A’ or C’ are attracted to this periodic orbit. All of the other bouncers are pushed away from this A-C orbit, and attracted to the B-D attractor. It is in this sense we say that the A-C orbit is a saddle type repeller. The A-C orbit corresponds to a periodic orbit of the dynamics 5.1-5.4 of which the membrane potentials v 1 = q ( t ) and u2 = u2(f) oscillate exactly in phase with each other.
Remark. For those readers who are interested in more details on the In the twoderivation of 5.12, it is instructive to try the map DD” dimensional (9,. q2)-plane, points on the segment DD” are represented by ( 9 1 . 9 2 ) = (1 w, -1 - w x ) o ~ ~ < zand ~ , , those on AB by (91.92) = (- 1-w -y, 1+w) -2-2m
+
m.
+
+
~
+
References Belair, J., and Holmes, P. 1984. On linearly coupled relaxation oscillations. Quart. Appl. Math. 42, 193-219.
338
Wei-Ping Wang
Bienenstock, E., and Geman, S. 1995. Compositionality in neural systems. In The Handbook of Brnirr Tlzeory nrrd Nertra! Nrtic?c1rlisM. A. Arbib, ed., pp. 223-226. MIT Press, Boston. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reiboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d C!/brrrr. 60, 121-130. FitzHugh, R. 1961. Impulses and physiological states in theoretical models of nerve membrane. B i o p l i y s ~ a 1. l 1, 445466. Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Natiirt‘ (Lontfori) 338, 334-337. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. Iritrodirctioii to the Tl~eoryof Nczirnl Corrrpitiitioii. Addison-Wesley, Reading, MA. fiinton, G. E., and Sejnowski, T. J. 1983. Optimal perceptual inference. In P I ’ ( I c - W ~ ; ~ Of ~ ~ tht’ S I € € E COrI fc’rc’r/Ct’ CJ ! I COrI?J?if fPr ViSiOlI arid PO f ft’rfI RtTOs rl it ion iWdiirigtori 798.31, pp. 148453. IEEE, New York. Hinton, G. E., and Sejnowski, T. J. 198b. Learning and relearning in Boltzmann machines. I n Pmdlel Distrilintc.d Prorcsirig: E q d n r a t i o m irr thr Microstvitctitvc of Cogrzitiorr, D. E. Rumelhart, J. L. McCleliand, and the PDP Research Group, eds., Vol. 1, Chap. 7. MIT Press, Cambridge, MA. Hirsch, M. W. 1987. Convergence in neural nets. Proc. 1987 liit. Coirf. Neural NPticIorks, Srlil D i q o 11, 115-125. Hirsch, M. W., and Smale, S. 1971. Differwtinl Equnfioiis, D!yrznriiicnl S!ystcrns arid Lirirnr A/$!rtJro.Academic Press, New York. Hopfield, J. 1. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Not/. Acad. Sc.i. U.S.A. 79, 2554-2558. Kopell, N. 1988. Toward a theory of modeling central pattern generators. In N~vrrn/Coiitrol ofR/i!/thic Moiwiwts i r i Vc.rtelvatcs, A. H. Cohen, S. Rossignol, and S. Grillner, eds., pp. 369413. John Wiley, New York. Levi, M. 1981. Qualitative analysis o f the periodically forced relaxation oscillations. Mmoirs AMS 32 (211), 1-117. American Mathematical Society, Providence, RI. Levinson, N. 1949. A second order differential equation with singular solutions. Aiiii. Mntk. 50, 127-153. Mumford, D. 1991. On the computational architecture of the neocortex. I. The role of the thalamo-cortical loop. B i d . Cyberrr. 65, 135-145. Mumtord, D. 1992. On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biol. C y l w r r . 66, 241-251. Nagumo, J., Arimoto, S., and Yoshirarta, S. 1962. An active pulse transmission line simulating nerve axon. Proc. IRE, Vol. 50, 2061-2070. Nicholls, J. G., Martin, A. R., and Wallace, B. G. 1992. Frorii Neiirori to Broiii 3rd ed., Sinauer Associates, Sunderland, MA. Rand, R. H., and Holmes, P. J. 1980. Bifurcation of periodic motions in two weakly coupled van der Pol oscillators. Irit. 1. NOH-LirirnrMcclrnriics 15, 387399. Rowat, P. F., and Selverston, A. I. 1993. Modeling the gastric mill central pattern
Binary-Oscillator Networks
339
generator of the lobster with a relaxation-oscillator network. J. Neurophysiol. 70(3), 1030-1053. Smale, S. 1972. On the mathematical foundations of electrical circuit theory. Differential Geomety 7, 193-210. Takens, F. 1976. Constrained equations: A study of implicit differential equations and their discontinuous solutions. Structural Stability, the Theory of Catastrophes, and Applications in the Sciences, P. Hilton, ed., Lecture Notes in Math. Vol. 525, pp. 147-243. Springer, Heidelberg. van der Pol, B. 1926. On “relaxation-oscillations.” Pkilos. Mag. 2, 978-992. Wang, W.-P. 1994. Dynamics of isotropic excitatory neural systems: An exploration of global coding, internal logic and compositionality of shapes. Preprint. Wang, X.-J., and Rinzel, J. 1992. Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp. 4, 84-97.
Received March 22, 1994; accepted June 26, 1995
This article has been cited by:
Communicated by Steven J. Nowlan
Annealed Competition of Experts for a Segmentation and Classification of Switching Dynamics
We present a method for the unsupervised segmentation of data streams originating from different unknown sources that alternate in time. We use an architecture consisting of competing neural networks. Memory is included to resolve ambiguities of input-output relations. To obtain maximal specialization, the competition is adiabatically increased during training. Our method achieves almost perfect identification and segmentation in the case of switching chaotic dynamics where input manifolds overlap and input-output relations are ambiguous. Only a small dataset is needed for the training procedure. Applications to time series from complex systems demonstrate the potential relevance of our approach for time series analysis and short-term prediction. 1 Introduction
-~
Neural networks provide frameworks for the represtntation of relations present in data. Especially in the fields of classification and time series prediction, neural networks have made substantial contributions. An important prerequisite for the successful application of such systems, however, is a certain uniformity of the data. In most analysis of data series, stationarity must be assumed, i.e., it must be assumed that the relations remain constant over time. If, on lemporarv address The Salk Institute, CNL, Box 85800, San Diego, CA Y2186-5800 \tv!tii!
C O I J I ~ I ~ ~8,I 34s-356 ~ ~ I ( V I(1996)
9 1996
Massachusetts Institute of Technology
Segmentation of Switching Dynamics
341
the contrary, the data originate from different sources, e.g., because the underlying system switches its dynamics, standard approaches like simple multilayer perceptrons are likely to fail to represent the underlying input-output relations. Such time series can originate from many kinds of systems in physics, biology, and engineering. Phenomena of this kind include, e.g., speech (Rabiner 19881, brain data (Pawelzik 19941, and dynamic systems that switch their attractors (Kaneko 1989). In this paper we present a method for the segmentation of such data streams without prior knowledge about the sources. We consider the case where the different input-output samples [x(t),y(t)] are generated by a number n of unknown functions f r ( t ) , I = 1 . . . , n, which alternate according to l(t), i.e., y(t) = fict)[x(t)].The task then is to determine both, the functions fi together with their respective attributions l(t) from a given time series { ( x ( t ) ,y(t))}E,. Since both the functions and the segmentation are considered to be unknown, they have to be determined simultaneously, i.e., the correct segmentation has to be found in an unsupervised manner. The mixtures of experts architecture, as proposed by Jacobs et al. (19911, potentially offers a solution to this problem, since it can represent different functions by the respective experts. There are, however, problems when applying the mixtures of experts architecture to the task of identifying alternating sources. One problem arises, when the gating of the experts is based on the input alone, because in general the underlying sources will have overlapping input domains. To solve this problem, we here use an ensemble of expert networks whose competition depends only on their relative performance and not on the input. This way of introducing the competition relates to clustering and vector quantization (McLachlanand Basford 1988) and is in contrast to the mixtures of experts architecture that uses an input-dependent gating network (Jacobset al. 1991). When the sources have overlapping arguments, a further probiem arises: the functions may intersect. In this case, there are input-output pairs that are identical for different functions i # j , i.e., there are (x.y) for which y = fi(x) = f , ( x ) . As we will show, such intersections induce additional ambiguities, a further problem, which can be resolved only by imposing additional constraints. We present a learning rule performing this disambiguation, which is derived from a simple assumption about memory in the switching process: a low switching rate. This assumption allows one to train the system of experts on very small data sets and does not require any statistics of switching events. In particular, the method can identify switchings in a time series from only a number of data that just suffices to characterize the two respective functions. Our approach does not provide an analysis of the dynamics of the switching itself, which has been adressed in Cacciatore and Nowlan (1994) and Bengio and Frasconi (1994) and we discuss the relation of these approaches to our work in Section 5.
342
K. Pawelzik, J. Kohlmorgen, K.-R. Muller
For clear cut segmentation, each sample [x(f).y(f)]must be assigned to only one expert. This can most easily be achieved by considering only the respective best performing expert. However, when using such hard competition during training, it is likely to get stuck in local minima, which in some cases can be overcome by using sample dependent ad hoc iiiitializations (Kohlmorgen rt d.1994; Muller ef al. 1994, 1995). As a more general approach, we here propose to anneal the competition of the networks adzabafica/l!y during training (see also Yuille et a/. 1994). We will show that with this method the networks successively specialize in a hierarchical manner via a series of phase transitions, an effect that has been analyzed in the context of clustering by Rose rf al. (1990). In Section 2, we introduce our approach and in Section 3 we demonstrate the features of our method with an example of alternating functions over the unit interval that intersect. In that example, the input-output samples are given by the dynamics of chaotic maps and the experts correspond to predictors. This relates our method to common techniques in system identification (Shamma and Athans 1992), and time series prediction (Tong and Lim 1980). In Section 4, we apply our method to benchmark data from the Santa Fe Time Series Prediction Competition (Weigend and Gershenfeld 19941, an application that demonstrates that our approach may substantially improve predictions of time series and opens new perspectives for signal classification, which we finally discuss in Section 5. 2 Unmixing of Experts
Data originating from different sources are subject to ambiguity. If inputoutput relations are considered, this can have at least two interdependent reasons. First, the input domains may overlap. However, it is impossible for a single network to map the same inputs to different outputs without using extra information. Second, input aizd output of different sources can be identical for a subset of the data. In this latter case, information beyond the input-output pairs is required in order to reassign the data to the sources. For illustrating the basic ideas underlying our approach we discuss the extreme case of completely overlapping input manifolds. An example is given by input-output pairs ( . y f . y f ) = [x,.f,(x,)].f = 1 . . . . .T, that at each time step f are a choice 1 = I ( t ) .I = 1.2.3.4 of one of the four maps fl(.x) = 4x(l - s ) . x E jO.1; ("logistic map"), f r ( x ) = (2.u if x E 10.0.5) and 2jl - x). if s E !0.5. l:} ("tent map"), f3 = fi o fi ("double logistic map") or f4 = f?of2 ("double tent map"). f of denotes the iteration f$ixr]. If we set xi-.] = yf, we get a chaotic time series {x,} with x f s l = fi[xi1 (see Fig. 1). When these maps are alternately used, a given input x, alone does not determine the appropriate output y,, and a representation of the underlying relations therefore must )zecrssnrily contain a division
Segmentation of Switching Dynamics
343
1 09
08 07
06 05 04
03
02 01 0
0
01
02
04
03
05
06
07
08
09
1
(a) I
I
1
I
I 50
100
150
200
250
300
350
400
(b)
Figure 1: (a) Training data drawn from four chaotic return maps, 300 points for each map. A new map is chosen after every 100 recursions. The first 400 values of the resulting time series are shown in (b).
into subtasks. For such data sets, a gating network that depends only on the input (Jacobs et al. 1991) must necessarily fail. In our approach, we therefore adapt a set of predictors f,. i = 1,. . . . n, weighted only by their relative performance. The optimal choice of function approximators depends on the specific application. Throughout this paper we are using radial basis function networks (RBFNs) of the Moody-Darken type (Moody and Darken 1989), because they offer a fast learning method. We train the weights w,of network i by performing a
7;
K. Pawelzik, J. Kohlmorgen, K.-R. Miiller
344
gradient descent (2.1) on the squared errors
The weighting coefficient pf corresponds to the relative probability for a contribution of network i and the pf are constrained to be El p , = 1. Our approach differs from previous work in the way the pis incorporate memory that is present in the switching process. We start by assuming that the outputs jl(xf)are distributed according to gaussians, i.e.
We furthermore assume that the system does not switch its state I(t) every time step, but instead alternates among the different subsystems i # j at low rates rlI < r, which is a rather weak assumption about the memory of the switching process, which we will use in the following to derive a simple bias in the probabilities pi. Then, the probability that a given subsequence up = {[x(t - A),y ( t A)].. . . [ x ( t+ A ) ,y(t + A)]} of the time series is generated by a particular sequence sf = [ l ( t - A). . . . , I(t A)] of functions fi(r)is given by
+
where cp denotes the corresponding sequence of errors. Bayes' rule in this case gives
(2.5) where the sum runs over all possible sequences s'*. This equation can strongly be simplified in case of a low bound r on the switching rate when short sequences have a small probability q to contain a switching event, i.e., if A 5 qlr. In this case, we can neglect sequences which contain switchings, i.e., we set p ( s A ) = 0 if not all components are equal. The remaining rz sequences are considered equiprobable according to maximum entropy [i.e., p(sA)= l/n], and we then obtain from equations 2.4 and 2.5 (2.6) the probability that f i generated the subsequence of. Using equation 2.3,
345
Segmentation of Switching Dynamics
0
0.1
0.2 0.3 0.4
0.5 0.6
0.7 0.8
0.9
1
0.7 0.8
0.9
1
X
0
0.1
0.2
0.3 0.4
0.5
0.6
X
(b) Figure 2: (a) Result for hard competition without prior annealing: Although a proper initialization was intended, one net grabbed two "similar" return maps, f j and f2. A distinction between these two maps is no longer possible and the prediction error for both maps remains high. (b) Annealing without the inclusion of memory allows the creation of maps that jump from one target map to another along the x-axis. The information, that consecutive data points belong with high probability to the same dynamics, is not utilized.
this finally provides the estimate for the weighting coefficients:
,-PELa r
(2.7)
Note, that this result equivalently, but less intuitively, can be derived
346
K.Pawelzik, J. Kohlmorgen, K.-R. Muller
from minimizing the free energy F = C flog{C, p ( ~ Ii i ) p ( i ) } under the above assumptions. For A = 0, this reduces to a mixture of gaussians (McLachlan and Basford 1988) and is equivalent to a mixture of experts (Jacobs ef al. 1991), however, without a gating network. According to equation 2.7, we can simply use low-pass filtered errors instead of the plain ~f to include memory that originates from a low switching rate. The drastic simplification of memory (probability 0 for sequences of length A, which include switching) led to the box-type filter, which might be replaced by an exponential’ to model the switching probabilities more realistically. Yet, without any knowledge about the characteristics of the time-series, equation 2.7 seems to be the simplest and at the same time computationally least expensive way to include memory. Heuristically, equation 2.7 is analogous to evolutionary inertia, since once a predictor has performed better than its competitors, it also has an advantage for temporally adjacent data points. This helps to regularize data at ambiguities. In the example of the chaotic maps, such ambiguities emerge at the intersections, where additional information is required to decide which branches of the function ”belong together.” For the purpose of segmentation, it might seem to be most desirable to choose !I large. Indeed, one could consider / j = ’XI, which corresponds to hard competition (winner-takes-all) and guarantees an unambiguous segmentation (Kohlmorgen ef 01. 1994; Muller ef al. 1994, 1995). We found, however, that using hard competition right from the beginning does not always lead to a sufficient diversification of the predictors. The final result in general depends on the choice of initial parameters, which may lead to local minima in the likelihood F, and a mixing of maps can occur (see Fig. 2 4 . We solve this initialization problem by adiabatically increasing the degree of competition. For , I = 0, the predictors equally share the same data for training. Increasing :-I enforces the competition, thereby driving the predictors to a specialization on different subsets of the data. Diversification occurs at particular ”temperatures” T = 1//i and the network parameters separate abruptly, resolving the underlying structure to more detail. These phase transitions are indicated by a drop of the mean squared error E = XI C, pi:: (see Fig. 3 4 and have been described within a statistical mechanics formalism (Rose et a/.1990). Note, that a careful decrease of T is crucial when fine differences of underlying functions have to be resolved. 3 Applications to Switching Chaos
First, we illustrate our approach with a time series of N = 1200 points from the four chaotic maps f i . . . . .f4 introduced above (Fig. 1). These maps were alternated every 100 iteration steps. Because these dynamic ’The latter would yield a weighted low-pass filter in equation 2.7
Segmentation of Switching Dynamics
347
systems are ergodic on the support x E [O. 11, they cannot be distinguished on the basis of their arguments alone. Furthermore, the small rate Y, = 1/100 guarantees a large probability for short sequences of, e.g., length I = 7 to contain no alternations of the underlying system, which justifies our simple method of taking memory into account by setting A = 3 in equation 2.7. Note, however, that this parameter is not crucial. We used 6 radial basis function networks of the Moody-Darken type (Moody and Darken 1989) as predictors and decreased the temperature T = 1 / B adiabatically, i.e., the next smaller value of the temperature is taken, when the overall error E had saturated. The result is shown in Figure 3. The error decreases most during phase transitions (Fig. 3a), which occur when the different underlying dynamics abruptly become resolved to more detail (Fig. 4). After the relevant structures have been found by the algorithm, no further phase transitions occur and there is only little further decrease of the error when T approaches zero. At T N 0, we find that four networks (out of six) segmented the time series almost exactly at the switching points, while two drifted off (Fig. 3b), did not contribute, and therefore could be removed without changing the performance E . The method can be applied to time series from high-dimensional chaotic systems simply by replacing the scalar argument x by vectors that are obtained by the method of time delay embedding of the time series (Takens 1981, Liebert et al. 1991) and by a corresponding adaptation of the networks. As an example for a high-dimensional chaotic system, we take the Mackey-Glass delay-differential equation - td) ( dt3 3 = -O.lx(t) + 10.2x(t + x ( t fd)’O -
(3.1)
originally introduced as a model of blood cell regulation (Mackey and Glass 1977). We generated a time series of N = 400 points where we switched the delay parameter t d . For the first and last 100 samples (sampling rate 7 = 6) we chose t d = 17, whereas for the second 100 samples we used t d = 23 and for the third fd = 30. To increase the difficulty of the problem, 5% noise was added at each integration step, thereby turning the system stochastic (Fig. 5a). For the creation of a training set out of this time series, an embedding dimension m = 6 was used (Casdagli 1989).
During training, two phase transitions occurred (Fig. 5b), indicating that the system detected the different dynamic systems. The second transition (at T z 0.0007) becomes more prominent when simpler networks are used. However, this leads to suboptimal prediction results and was therefore not applied. The removal of three nets at T r 0 did not increase the error significantly (Fig. 5c), which correctly indicates that three predictors completely describe the source. Segmentation, finally, was perfect (Fig. 5d). The performance (convergence speed, segmentation accuracy) of our approach with the high-dimensional Mackey-Glass data was even
K. Pawelzik, J. Kohlmorgen, K.-R. Miiller
348
03
025
-
02
-
TrainErrorvernp) TestError(Ternp ) - - -
go15 -
w
01 -
I 0
01
0.2 0 3
04
05
06
07
08
09
1
X
(b) Figure 3: (a) Training and test error during the annealing process both indicate phase transitions. (b) The maps learned by the RBFNs at the end of the process. Four nets have specialized on each of the given dynamics, while two nets dropped off and finally did not contribute to the segmentation and the overall error E.
better than for the one-dimensional maps, which indicates that in higher dimensions segmentation and identification can be easier, possibly because of a weaker overlap of manifolds in higher dimensions. 4 Prediction
The assumption of stationarity is problematic in many cases of data analysis. Our approach provides a diagnostic tool as well as a good predictive
Segmentation of Switching Dynamics
349
1
0.8
--
0.6
0.4 0.2
0 I
0
0.1
I
,
0.2 0.3 0.4
I
l
l
0.5 0.6 0.7 0.8
I
0.9
1
0.8 0.9
1
X
(a)
0
0.1
0.2 0.3
0.4 0.5
0.6
0.7
X
(b)
Figure 4: Shown are the maps that have been learned by the predictors, (a) before the first and (b)-(d) after each of three phase transitions. The final result, after training has reached hard competition, is shown in Figure 3b. solution for problems where nonstationarities are present due to random jumps of system parameters. In this section we demonstrate the relevance of our approach for the prediction of time series. Yet, we would like to stress that although we obtain a very good prediction wifhin two switchings, we do not solve the problem of predicting the next point in time where the system will most probably switch its state. For this, the statistics of switching have to be included into the model, but this obviously would make a much higher amount of data necessary. We applied our method to the prediction of data set D from the Santa Fe Time Series Competition (Weigend and Gershenfeld 1994). This scalar
K. Pawelzik, J. Kohlmorgen, K.-R. Miiller
350
1
08
-- 0.6 x
1
0.4
0.2
0 0
01
02
03
04
05
06
07
08
09
1
X
(d)
data set was generated from a nine-dimensional periodically driven dissipative dynamic system with an asymmetrical four-well potential and a drift on the parameters. We used 6 RBF predictors that predict a data point using 20 preceding points, i.e., the embedding dimension was tti = 20. The training set was restricted to the last 2000 points of data set D to keep the computation time tolerable. After training was finished, the prediction of the training data was shared among the predictors. The prediction of the continuation of data set D was simply done by iterating the particular predictor that was responsible for the generation of the latest training data. This predicted continuation was then compared to the true one-the test set-which was originally unknown to the participants of the competition. Our method was quite useful for
Segmentation of Switching Dynamics
351
1.2
1
-%
0.8
0.6 0.4
c
'
0
50
0.2
1
100
150
200 t
I 250
300
350
400
(a) 0.14
0 13 0.12 0 11
g
0.1
0.09 0.08
0.07
0.0001
0.001
Temp.
0.01
0.1
(b) Figure 5: (a) A noisy Mackey-Glass time series that includes 3 different dynamics was used for the segmentation task. (b) Adiabatic evolution of the training error for the Mackey-Glass data. u p to 50 time steps (see Fig. 6a). After 50 steps, the system presumably performs a switch to another part of its potential, which per construction cannot be foreseen by our approach, since the switching statistics have not been taken into account. Nevertheless, we tested the ability of this method to predict other parts of the test set by the other predictors and also found good performance up to about 50 time steps (Fig. 6b). Again, we found that the prediction fails when the system apparently jumps into a different state. Although the underlying system in this case was almost stationary, these results demonstrate that divide and conquer is a useful strategy here, because of the high dimensionality of the system and the
K. Pawelzik, J. Kohlmorgen, K.-R. Miiller
352
018, 016
1
~
014 012 -
g W
01 -
008
A
0 04 1
#Removed Nets
(C,
Figure 5: (c) Increase of the prediction error when successively removing predictors (after training). Although no further training has been performed, up to three predictors can be removed without a significant increase of error. The three remaining predictors specialized o n each of the dynamics present in the data, as indicated by the p:s for each net, shown in (d).
complex form of the potential. A quantitative comparison with the winners of the Santa Fe Competition, Zhang and Hutchinson (Weigend and Gershenfeld 1994, pp. 219-2411, demonstrates the power of our method. These authors applied a stationary approach that uses 100 hrs of training time on a Connection Machine CM-2 with 8192 processors, and achieved a prediction error of 0.0665 (RMSE, root mean squared error), which they computed only for the first 25 step predictions, because their prediction broke down after that. Even if we compare our prediction only for this
353
Segmentation of Switching Dynamics
1
0.9
0.8 0.7
0.6 0.5
0.4 0.3 0.2 0.1
0 0
10
20
30
40
50
60
70
80
60
70
80
90
100
(a) 0.9
0.8 0.7 0.6 0.5 0.4
0.3 0.2 0.1 0 0
10
20
30
40
50
90
100
(b)
Figure 6: Prediction (solid line) of the continuation of data set D (dashed line) using the competing predictors approach. The predictors decompose the dynamics of the time series into simpler prediction tasks, so that each predictor is able to predict certain segments of the data [as shown in (a) and (b)]. The accuracy for the first 25 step predictions is 10%better than the result of Zhang and Hutchinson, the winners of the Santa Fe Competition in 1991. short episode, we find an RMSE of 0.0596, that is 10% better, and training took just 2.5 hr on a SUN 10/20GX. Another well-known example of nonstationary dynamics in the real world is speech. Recently we also applied our method to predicting the dynamics of plain A/D-converted speech data. We find that the experts trained only on a single sentence already reliably segment this signal, so that the unsupervised segmentation according to the dynamics might be
354
K. Pawelzik, J. Kohlmorgen, K.-R. Miiller
used for word recognition. However, we did not observe a clear relation of the segmentation to the phonemes and we suspect that this requires a more careful choice of the experts, e.g., according to models of the vocal tract (for details we refer to Miiller rt nl. 1995).
5 Summary and Outlook
___
We presented a new approach for the analysis of time series. It applies to systems, where nonstationarities are caused by switching dynamics. The two salient ingredients of our method are riieriiory derived from a low switching rate (used in the mixing coefficients p : ) and an nrlidmfic enforcement of the competition during learning. We illustrated the performance of our approach with time series from alternating chaotic systems. In particular, we demonstrated that our approach is able to resolve ambiguities, w~hichare present in the general case of overlapping input-output relations, with only very few assumptions about the systems generating the data, thereby leading to a n unsupervised segmentation. The approach does not estimate the swirching process itself, but serves a s a n analysis tool for the dynamics between switching events. The method is very robust, since it does not require nmy statistics of switching events. Also note that our ansatz can nevertheless be used to obtain a model for the switching dynamics, once a valid segmentation is found. We should also point out here that the assumption of a low switching rate is essential to get the desired segmentation, at least when overlapping input domains are considered. This is due to the fact that for a given data stream a variety of switching dynamic systems are conceivable as its origin. In our framework, the choice of models is a priori limited by the number of predictors, and tlie predictors we use allow only for relatively simple and smooth mappings. Nevertheless, it is still possible to fit the data in various ways; at least the training process is likely to select a wrong model and to get stuck in local minima of the error function. Constraining the training process to find only those models with a relatively low switching rate solves this problem (of course, only in those cases where the dynamics does indeed switch at low rates). We do this by imposing a low-pass filter on the errors. With this additional constraint, it is likely to obtain the correct segmentation together with appropriate models for the underlying sources. This became evident when we compared our approach with the mixtures of controllers architecture (Cacciatore and Nowlan 1994). In this extension to the mixtures of experts (Jacobs ~t nl. 1991), a Markov assumption about the switching characteristics is made in advance. A gating network is to l~.nmtlie operating mode together with the dynamics of the sources. We tested the mixtures of controllers on the data presented in this paper. We found that this architecture only incidentally came u p Lvith the correct segmentation. In most cases it failed to converge to the
Segmentation of Switching Dynamics
355
correct model and ended u p in a heavily switching solution. In contrast to that, our method always yielded the correct result. Another problem arises for the mixture of controllers approach, when overlapping input domains are considered. Then, input sequences appear that do not allow for a unique determination of the source. For totally overlapping input domains, as in our example, this is always the case. Since the gating network is triggered only by the input data, it receives no information about the operation mode and hence cannot produce a reasonable segmentation. Taking into account that time series of switching dynamics with more or less overlapping input domains are the really challenging tasks? this appears to be a considerable disadvantage. Two applications demonstrate the power of our approach: prediction of time series and segmentation of speech data (presented in Muller ef al. 1995). When our approach is used to predict complex dynamics, the prediction quality can be improved significantly due to the divide-and-conquer strategy inherent in the ensemble of experts. In particular, we can significantly improve the results of the Santa Fe Prediction Competition (Weigend and Gershenfeld 1994) on data set D, which shows that this time series can efficiently be described as a switching dynamics. Our future work will be dedicated to the application of this method to forecasting problems and to the classification of continuously spoken words. Further interest is also to estimate the dynamics of switchings to predict not only the interswitch dynamics but also the dynamic changes themselves.
Acknowledgments
K.P acknowledges support of the DFG (Grant Pa 569/1-1) and K.-R. M. acknowledges financial support by the European Communities S & T fellowship under contract FTJ3-004. References Bengio, Y., and Frasconi, I? 1994. Credit assignment through time: Alternatives to backpropagation. NlPS 93. Morgan Kaufmann, San Mateo, CA. Cacciatore, T. W., and Nowlan, S. J. 1994. Mixtures of controllers for jump linear and non-linear plants. NlPS 93. Morgan Kaufmann, San Mateo, CA. Casdagli, M. 1989. Nonlinear prediction of chaotic time series. Physica D 35, 335-356. Jacobs, R. A., Jordan, M. A., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Cornp. 3, 79-87, 21n one-dimensional time series where the modes of the dynamics produce data in distinct domains, segmentation can be done by merely looking at the data.
K. Pawelzik, J . Kohlmorgen, K.-R. Muller
356
Kaneko, K. 1989. Chaotic but regular posi-nega switch among coded attractors by cluster-size variation. Pliys. RCP.Lett. 63, 219. Kohlmorgen, J., Miiller, K.-R., and Pawelzik, K. 1994. Competing predictors segment and identify switching dynamics. In Proceedirigs of the Infernotionol ConfererictJori Arfificiol Ntwol Nt’tic’ds, ICANN 94, pp. 1045 ff. Springer, London. Liebert, W., Pawelzik, K., and Schuster, H. G. 1991. Optimal embeddings of chaotic attractors from topological considerations. Eirropliys. Lctt 14, 521. Mackey, M., and Glass, L. 1977. Oscillation and chaos in a physiological control system. Scicrice 197, 287. McLachlan, G. J., and Basford, K. J. 1988. Mixfiw Modtds. Marcel Dekker, New York. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned Z ~281-294. . processing units. Ncirrnl C C J V 1, Muller, K.-R., Kohlmorgen, J., and Pawelzik, K. 1994. Segmentation and identification o f switching dynamics with competing neural networks. ICONIP 9-1: Proc. Irif. Coizf. Ncirrnl hzforiri. Processirig, Seoul, pp. 213-218. Miiller, K.-R., Kohlmorgen, J., and Pawelzik, K. 1995. Analysis of switching dynamics with competing neural networks. [€ICE Trorzsactioris orz Ficiiilornerztnls of Elecfroriics, CoinriiirriisotioIIj nrttf Computer Scirnces, E78-A, No. 10. Pawelzik, K. 1994. Detecting coherence in neuronal data. In Physics of Neuro1 Netiuorks. E. Domany, L. Van Hemmen, and K. Schulten, eds., Springer, Berlin. Rabiner, L. R. 1988. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77, 257-286. Rose, K., Gurewitz, E., and Fox, C. 1990. Statistical mechanics and phase transitions in clustering. Pliys. R e c Lett. 65, 945-948. Shamma, J. S., and Athans, M. 1992. Gain scheduling: Potential hazards and possible remedies. I E E E Coiitrol Sysf. Mog. 12(3), 101-107. Takens, F. 1981. Detecting strange attractors in turbulence. In D!/riornicol Systems iind T i f r h l w c c , D. Rand and L.-S. Young, eds., Vol. 898, p. 366. Springer Lecture Notes in Mathematics. Tong, H., and Lim, K. S. 1980. Threshold autoregression, limit cycles and cyclical data. J. R. Stnt. Soc. B 42, 245-268. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. I E E E I t i t . Conf. Acoustics, S F J W En‘ i~i d Sigrid Processing, 37, 328-339. Weigend, A. S., and Gershenfeld, N. A,, eds. 1994. Tirne Series Prediction: Foremstirig the Firtiire orid Urzdersfnriiliiig the Post. Addison-Wesley, Reading, MA. Yuille, A. L.,Stolorz, I?, and Utans, 1. 1994. Statistical physics, mixtures of distributions, and the EM algorithm. Neitrol Comp. 6, 334-340. .~
-
~-
Received September 22, 1994, accepted June 13, 1995
This article has been cited by: 2. Philippe Giguere, Gregory Dudek. 2009. Clustering sensor data for autonomous terrain identification using time-dependency. Autonomous Robots 26:2-3, 171-186. [CrossRef] 3. Chulsang Yoo, Jooyoung Park. 2008. Rainfall-runoff analysis based on competing linear impulse responses: decomposition of rainfall-runoff processes. Hydrological Processes 22:5, 660-669. [CrossRef] 4. Kazuyuki Samejima, Ken'Ichi Katagiri, Kenji Doya, Mitsuo Kawato. 2006. Multiple model-based reinforcement learning for nonlinear control. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 89:9, 54-69. [CrossRef] 5. M.-W. Chang, C.-J. Lin, R.C.-H. Weng. 2004. Analysis of Switching Dynamics With Competing Support Vector Machines. IEEE Transactions on Neural Networks 15:3, 720-727. [CrossRef] 6. A. Kehagias, V. Petridis. 2002. Predictive modular neural networks for unsupervised segmentation of switching time series: the data allocation problem. IEEE Transactions on Neural Networks 13:6, 1432-1449. [CrossRef] 7. I. Lapidot, H. Guterman, A. Cohen. 2002. Unsupervised speaker recognition based on competition between self-organizing maps. IEEE Transactions on Neural Networks 13:4, 877-887. [CrossRef] 8. Kenji Doya , Kazuyuki Samejima , Ken-ichi Katagiri , Mitsuo Kawato . 2002. Multiple Model-Based Reinforcement LearningMultiple Model-Based Reinforcement Learning. Neural Computation 14:6, 1347-1369. [Abstract] [PDF] [PDF Plus] 9. Masahiko Haruno , Daniel M. Wolpert , Mitsuo Kawato . 2001. MOSAIC Model for Sensorimotor Learning and ControlMOSAIC Model for Sensorimotor Learning and Control. Neural Computation 13:10, 2201-2220. [Abstract] [PDF] [PDF Plus] 10. Zoubin Ghahramani , Geoffrey E. Hinton . 2000. Variational Learning for Switching State-Space ModelsVariational Learning for Switching State-Space Models. Neural Computation 12:4, 831-864. [Abstract] [PDF] [PDF Plus] 11. S. Policker, A.B. Geva. 2000. Nonstationary time series analysis by temporal clustering. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:2, 339-343. [CrossRef] 12. A.N. Srivastava, R. Su, A.S. Weigend. 1999. Data mining for features using scale-sensitive gated experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:12, 1268-1279. [CrossRef] 13. C.L. Fancourt, J.C. Principe. 1998. Competitive principal component analysis for locally stationary time series. IEEE Transactions on Signal Processing 46:11, 3068-3081. [CrossRef]
14. Athanasios Kehagias , Vassilios Petridis . 1997. Time-Series Segmentation Using Predictive Modular Neural NetworksTime-Series Segmentation Using Predictive Modular Neural Networks. Neural Computation 9:8, 1691-1709. [Abstract] [PDF] [PDF Plus] 15. Jeff S. ShammaGain Scheduling . [CrossRef]
Communicated by John Hertz
A Recurrent Network Implementation of Time Series Classification Vassilios Petridis Athanasios Kehagias Division of Electronics and Comparative Engineering, Department of Electrical Engineering, Aristofle University of Thessaioniki, 540 06 Thessaloniki, Greece
An incremental credit assignment (ICRA) scheme is introduced and applied to time series classification. It has been inspired from Bayes’ rule, but the Bayesian connection is not necessary either for its development or proof of its convergence properties. The ICRA scheme is implemented by a recurrent, hierarchical, modular neural network, which consists of a bank of predictive modules at the lower level, and a decision module at the higher level. For each predictive module, a credit function is computed; the module that best predicts the observed time series behavior receives highest credit. We prove that the credit functions converge (with probability one) to correct values. Simulation results are also presented. 1 Introduction
Consider the following problem of time series classification. A time series yt, t = 1,2,.. . is produced by a source S(&), where Bk is a parameter taking values in a finite set 0 = {O,>.. . , &} and the ”true” or ”best” value of Ok is sought. This problem appears in many practical applications, e.g., speech recognition (Rabiner and Schafer 1988) and enzyme classification (Papanicolaou and Medeiros 1990). An extensive list of applications can be found in Hertz et al. (1991). In this paper we present an incremental credit assignment (ICRA) scheme that assigns credit to each source according to its predictive power. This approach yields a hierarchical architecture with a prediction level at the bottom and a decision level at the top. We present a recurrent, hierarchical, modular neural network implementation of this architecture. A bank of local prediction modules are trained, each on data from a particular source S ( & ) . The prediction modules can be implemented by several different kinds of feedforward neural networks: sigmoid, linear, gaussian, polynomial etc. The decision module is implemented by a recurrent gaussian network that combines the outputs of the prediction modules. The overall structure of the network is presented in Figure 1. We prove that the Neural Computation 8, 357-372 (1996) @ 1996 Massachusetts Institute of Technology
Vassilios Petridis and Athanasios Kehagias
3 58
Figure 1 : The Network architecture. Summation neurons are denoted by C. Gaussian neurons are denoted by G, identity neurons are denoted by I . The symbol denotes weights determined by q:. The block denoted DECISION MODULE implements equation 3.4.
-
credit functions converge with probability one to correct values, namely, to one for the module with maximum predictive power and to zero for the remaining modules. Moreover, ICRA has an easy neural network implementation (using only adders and multipliers). ICRA has been inspired by classification based on the Bayesian posterior probabilities of the candidate sources, but the Bayesian connection is not necessary for developing ICRA or for proving its convergence properties. The idea of combining local models into a large modular network has recently become very popular. It is used for prediction as well as for classification of both static and dynamic (time series) patterns. Early examples of this idea are, for example (Farmer and Sidorowich 1988; Moody 1989), where a time series predzctioii problem is solved by partitioning the i i r p r l f space into a number of regions and training a local predictor for each region; in every instance, the local predictor used is explicitly determined by past input values, hence it is not necessary to assign credit to each predictor. A later development is the combination of local models using a weighted sum; the weights can be interpreted as conditional probabilities or as credit functions. This is the approach taken in Jacobs et al. (1991), Jordan and Jacobs (1992), Neal (1991), and Nowlan (1990), where the terms lord expcrfs and prohnbility niixfurt7s are
Time Series Classification
359
used; the term committees appears in Schwarze and Hertz (1992), the term neural ensembles in Perrone and Cooper (1993), and so on. Our point of view is similar to that of the above papers, insofar as we also use local models (predictors) and credit functions. However, ICRA is a recursive scheme for online credit assignment, so that classification at a given time depends on past classifications. This is particularly appropriate for classification of dynamic patterns, such as time series, where the history of the signal must be taken into account. In contrast, the above-mentioned papers use offline credit assignment and are applied either to static problems or "staticized" dynamic problems, where preprocessing is used to transform a time-evolving signal to a static feature vector (FFT or LPC coefficients, etc.). However, static feature vectors may not capture all the dynamic properties of a time series, especially in case of source switchings. On the other hand, while our method assumes that the classes to be used are given in advance, several of the above-mentioned papers present algorithms that discover an expedient partition of the source space. In fact, there are several neural algorithms that combine local models and adaptive partitioning (Ayestaran and Prager 1993; Baxt 1992; Jordan and Jacobs 1994; Kadirkamanathan and Niranjan 1992; Schwarze and Hertz 1992; Shadafan and Niranjan 1994). However, while such algorithms perform adaptive partitioning, they do not perform, as far as we know, adaptive classification, since they do not use classification results recursively. In short, our ICRA algorithm is applicable to problems of time series classification, where past classification results must be used for future classification, and classes are given in advance. 2 Bayesian Time Series Classification
A random variable Z that takes values in 0 = { e l . . . . &} is introduced. The time series ~ 1 . ~ 2 ... . is produced by source S(Z). For instance, if Z = 01, then the time series y l . y 2 > . . is produced by S(Q1).At every time t a decision rule produces an estimate of Z, denoted Z,. For instance, if at tzme t we believe that the time series y1. . . . y, has been produced by Q1, then Z,= 81. Clearly, Z,may change with time, as more observations become available. The conditional posterior probability pf for k = 1 2. . . . .K , t = 1.2, . . . is defined by ~
Prob(Z = 0, I yt.. . . , y l )
pf
also the prior probability
pt
A
p i for k = 1,2,. . ..K is defined by
Prob(Z = O k I at t
=0)
(2.1)
p{ reflects our prior knowledge of the value of Z. In the absence of any prior information we can just assume all models to be equiprobable: =
pi
Vassilios Petridis and Athanasios Kehagias
360
1/K fork = 1.2. . . . . K. p ) reflects our belief (after observing data y l . . . . .y,) that the time series is produced by S ( H k ) . We choose Z,= argmaxokEep;. In other words, at time t we consider that y l . . . . . y t has been produced by source S(Zt),where Z, maximizes the posterior probability. So the classification problem has been reduced to computing p:, t = 1 . 2 . . . ., k = 1 2.. . . . K. This computation (Hilborn and Lainiotis 1969; Lainiotis and Plataniotis 1994) is based on Bayes’ rule: ~
Also Prob(y,.Z=Ha I!/f-.1.....!/i)=Prob(y, l y f - i . . . y l . Z = & ) p : k- , Now equations 2.2 and 2.3 imply the following recursion for k t = 0.1.2.. . .:
=
(2.3)
1.2.. . .
K,
and we need only (for each t and k ) to compute Prob(yt 1 1y-1.. . . . yl. Z = H k ) . This probability depends on the form of the predictor; the predictors have a general parametric form f( .; Q i ) , k = 1. . . . .K: k
yt =fiyf-l.....Yt-N;HA)
(2.5)
Typically, f (,;H k ) would be a feedforward (linear, sigmoid, gaussian, polynomial) neural network trained on data from source S ( H k ) . This predictor approximates y, zcdien the tivie series is prodliceif by S ( H k ) . For k = 1.2. . . . . K the prediction error d , k = 1 . . . . . K , t = 1 . 2 . . . . is defined by eA f
2~
y:
i.
-
!/,
(2.6)
It is nssunirif that 4 is a white, gaussian noise process, with conditional probability of the form
It then follows immediately from equations 2.5, 2.6, and 2.7 that
The probability assumption of equation 2.7 is arbitrary, but works well in practice, as will be seen in Section 5. The parameter 0: is the variance and C(0,) is a normalizing constant. Extensions for vector valued y, and c$ are obvious. The posterior probability p: of source HA), k = 1.2. . . . . K, for time f = I . 2. can be computed by means of the above equations. At time t the time series is classified to the source that maximizes the
Time Series Classification
361
posterior probability: (2.9) The recursion for p: is obtained from equation 2.1, 2.4, 2.5, 2.8, and 2.9. 3 Incremental Credit Assignment Scheme
In this section we introduce an incremental credit assignment (ICRA) scheme to be used for time series classification. ICRA is motivated from the Bayesian scheme, but it is simpler in implementation, requiring only adders and multipliers. In addition, ICRA classifies as well as, and sometimes better, than the Bayesian scheme, as will become obvious in Section 5. Finally, ICRA has desirable convergence properties that can be mathematically proved. Hence ICRA is an attractive alternative to Bayesian classification. To develop ICRA, start by defining
Now consider the following difference equation
with initial condition (k = 1.2, . . . K ) K
(3.3) It is clear that if the qfs converge, in equilibrium (9: = qf-,) we will have 9; Y g(e:)p;-, / ~ ~ = l g ( ~ Since ) ~ - the l. in equation 3.2 are unknown, let us replace them by the qf-ls. After some rewriting, equation 3.2 becomes (3.4) Equation 3.4 is the important part of the ICRA scheme. Even though we have started with a Bayesian point of view, this can now be abandoned. We consider the 4: to be credit functions: the higher q: gets, the most likely S(&) is to be the "true" source. From equation 3.4 we see that the credit fuctions 92 are updated in an incremental manner, similar to a steepest descent procedure. At time t the time series is classified to source S ( Z ; ) , where
Z;
arg max qf eke
(3.5)
Vassilios Petridis and Athanasios Kehagias
362
Of course the use of equation 3.5 requires some justification; namely we must prove that if the "true" or "best" source is S(&), then q/" is greater than q:, k # i i i . This justification will be provided in the next section. Namely, we will prove that the q:s as given by equation 3.4 are convergent; in particular, the qj" associated with source S(Q,,,) of highest predictive power converges to one, while all other q f s converge to zero. Therefore the credit functions qf can be used for classification. In summary, the ICRA scheme is based on equations 3.3, 2.5, 2.6, 3.1, 3.4, and 3.5, which can be implemented by the recurrent, hierarchical, modular network of Figure 1. The bottom, prediction level of the hierarchy consists of a bank of predictive modules, each one implementing a predictor of the form of equation 2.5, for a specific value Or. Typically these modules are feedforward neural networks (sigmoid, linear, gaussian etc.) The top, decision level of the hierarchy consists of a module that implements equation 3.1; this module can be built from gaussian neurons. At this point we should emphasize that within this context the gaussian form y(t{) ceases to be an assumption about the statistical properties of error and becomes a matter of design regarding the credit assignment scheme. Also, we emphasize that ICRA can be implemented using only adders and multipliers, hence implementation is simpler than that of the Bayesian scheme. Finally, it should be mentioned that implementation of the ICRA scheme requires computation of equation 3.4 for k = 1 . 2 . . . . . K, which obviously scales linearly with K, the number of classes. Hence, time requirements of ICRA are O ( K ) :to handle 100 classes takes only 10 times more than to handle 10 classes if the algorithm is implemented serially. It should also be noted that equation 3.4 is fully pnmlleliznblc (see also Fig. 1) resulting in O(1) (constant) execution time for parallel implementation. Memory requirements are also O ( K ) ,since only the current '7:s need to be retained at every time step.' 67:
4 Convergence
We will now show that equation 3.4 has the following property: if HI,, is the "best" value of H [i.e., source S(H,,,)best predicts the data observed1 then qj" converges to 1 and qi converges to 0 for k # m. We start with the following lemma.
Proof. Proof will be by induction. Supposing I,"=, 95-( = 1, it will be shown that C,"=, qt = 1 as well. Summing equation 3.4 over k (and using 'The same time and memory requirements hold for the Bayesian classifier of Section 2.
Time Series Classification
363
t: 1
K
K
= k=l
+ Y Cqsk-18(~$) cd-ig(ei) k=l
Since the proposition is true for t 1,2,. . . proves the Lemma.
=
=
(4.1)
0, applying 4.1 repeatedly for s
=
Now we can state and prove the following convergence theorem. Theorem 1. Define ak = E [ g ( $ ) ] ,k = 1,.. . ,K. Suppose a, is the unique maximum of al , . . . , aK. If qr > 0,then q/” + 1and qt + 0 fork # m, with probability 1. Remarks. First, note that g($) is a random variable, since it is a function of the error ef. Assuming ef to be stationary, a k = E[g(eF)],i.e., the expectation of g(4), is time independent. Since g ( e ) is a decreasing function of ]el, a large value of ak implies good predictive performance. In this sense, ak can be viewed as a prediction quality index and it is natural to consider as optimal the predictor m that has maximum a,. In the course of the proof it will become clear that any function g(le1) could be used as long as g(1el) is a decreasing function of ]el. The theorem can be generalized to the case where there is more than one predictor that achieves maximum a,; then the totai posterior probability of all such predictors will converge to 1. The proof for that case is similar to the one presented here, and is omitted for economy of space. Finally, note that the credit functions q: are random variables, as they depend on y l , y2, . . . , y f . Hence, q: converge in a stochastic sense, in this case with probability one. Proof. For t = 0.1.2, . . ., define 3; to be the sigma field generated by 4; and {$5}5=o, with k = 1 , .. . . K . Define by .Em = u z l F t . Now, q: is .Ft measurable, for all k . t? This is so because qf is a function of e: . . . . . ef and 1 K of q1f P 1.,. . qf-l. But qt-l, . . . ,qf-, are, in turn, functions of . . . et-l and of qi-2r. . . .qfP2 and so on. In short, q! is a function of e i . . . . . eK, , ei.. . .e f l . Hence it is clearly Ft measurable. Also, for k = 1.2.. . . K , f = 0,1,2,. . ., define 7rF = E ( q f ) . In 3.4 let k = rn and take conditional expectations with respect to . F f - l . For all k and t we have E(qf-, I Ft-1) = q:-,, E[g(&) 1 F,-l] = E[g(e:)] = a k . In other words, g(e:) is independent of Ff-l. This is so because we assumed the noise process {$;}El to be
.
%
2A sigma field F generated by random variables u1.u 2 . . . . is defined to be the set of all sets of events dependent only on u1.~ 2 . ,. . . A random variable u is said to be F measurable if knowledge of u,, u 2 , . . . compietely determines u; in other words, either v is one of M I ,u2.. . . or it is a function of them: v(u1,u 2 , . . .). Note that the total number of u1,~2~ . . . may be finite, countably infinite, or even uncountably infinite. For more details see Billingsley (1986).
Vassilios Petridis and Athanasios Kehagias
364
white, hence c$ is independent of P!, I from Lemma 1, CF=lqiP1 = 1, hence
=
1 . .. . .K , s
=
1 . .. . . t
-
1. Finally,
From equation 4.2 follows that {# }& is a sidviznrtingillr. Since 0 5 E ( l q y ' / ) 5 1, we can use the Submartingale Convergence Theorem and conclude that, with probability 1, the sequence { q r } E Ocoiiverges to some random variable, call it q"', where ql" is F x measurable. We have assumed that qg > 0; from this, and equation 3.4 it follows that for all t we have q y > 0. From this it is easy to prove that the limit 9"' > 0. Hence, convergence of 9;' does izot depend on the initial values qt, k = 1.2. . . . . K, as long as q:' is greater than zero. However, we still d o not know whether the sequences {q:}:,,, k # if?, converge. Similarly, since qy q"', we can take expectations and obtain E ( q Y ) E(q"') = T " ' ; but we do not know whether E ( q f ) converges for k # i n . However, since Cf=:=, q: = 1 for all f, we have € ( ~ k + , l f q= ~1)- € ( q y ) 1 - Y . Now, if in equation 3.4 we set k = m and take the limit as t - x, we obtain +
+
-
(4.3) Since 4"' = limt-\ q y i > 0, equation 4.3 implies (4.4) The important point is that the quantity in curly brackets has a limit. Since q"' > 0, it can be cancelled on both sides of equation 4.4; then we get
r
1
(taking expectations and using the Dominated Convergence Theorem') him
'The Dominated Convergence Theorem states that under appropriate conditions, ,F ( t t )= Filim -~t,) See also Billingslev (1986)
Time Series Classification
365 r
(define al = rnaxkj,
ak
1
and note that a/ < a,)
lim [um(l - T F ~ )I] a1 lim fia:
t i m
bi. !
Ed-, + u , ~ ( I
-
I al(1 - P)(4.5)
From equation 4.5 it follows immediately that T , = 1; otherwise we could cancel 1 - T" from both sides of equation 4.5 and obtain a, 5 al, which is a contradiction. Hence 1 = rrn= limf-ra:TF, i.e., 1= limiiooE(qF) = E(limf,mqj"). Since limtdccqY 5 1, we must have limtia:qT = 1 with probability 1; it follows that limf-3cd = 0, for j # m, which completes the proof. 0 5 Examples 5.1 Logistic Classification. A logistic time series is produced by the following recursion (the source parameter is a ) Xt+l
= O X f ( 1 - Xt)
t = 1,2,. . . .
In the first set of experiments, a test time series has been generated by running a logistic with a = 3.8, for 182 time steps and then switching a to 3.6 and running the logistic for another 182 steps. Zero-mean white noise, uniformly distributed in the interval (-A/2. A/2] has been added to the data. We have used A = 0.00,0.05,.. . .0.50. We plot the time series (at noise level A = 0.2) in Figure 2. The task is to detect the active value of Q. We use our ICRA scheme and compare it to the Bayesian scheme. In both cases we use the same type of predictor modules. Ten predictor modules (18-5-1 sigmoid, feedforward neural networks) have been trained on logistics with Q = 3.0,3.1,.. . ,3.9, respectively. Average predictor training time was 2.5 min on a Sun Sparc IPC workstation. The 0 parameter is the same for both classifiers; we take it equal to the experimentally computed standard deviation of predictor error. For all prediction modules this is approximately equal to 0.25; so we have g1 = . . . = ~ 1 = 0 0.25. A probability threshold parameter k = 0.01 is also used. For the ICRA method we also use y = 0.99. Different values of 7 do not affect classification performance, as long as they are not too low. In general, small values of 0 and large values of 7 result in faster update of the p: and qf (see equation 3.4), hence in faster response of the algorithm. Finally, it should be mentioned that choice of pk, qk does not affect the convergence, as remarked in the previous section. This conclusion was supported by our experiments: while we tried several values for pi, qi classification performance was not affected. In the experiments reported here, we have always used pg = 1/K, q; = 1/K.
Vassilios Petridis and Athanasios Kehagias
366
0.00
I 1
I 51
101
151
201
25 1
30 1
35 1
Time (Steps)
Figure 2: Plots of logistic time series: for t = 1 . 2 . .. . ,182 we have (1 = 3.8; for t = 183... . ,364 we have (1 = 3.6. Noise level is A = 0.2. In Figure 3 the evolution of the 9;s is plotted for a typical experiment. Classification to the true logistic takes very few time steps: at t = 2 > q:, k # 9 and at t = 8 it reaches its steady-state value; then at t = 183 we have the (I transition and by time t = 189 we have 97 > qi, k # 7; at t = 194 97 has reached steady state (the whole transition takes 12 time steps). Location and width of the transition points of this experiment are typical; all the classification experiments we have run gave similar results. It should be emphasized that no training is required for the decision module; its online operation only requires computation of equations 2.5, 3.4 for all 10 predictors ( k = 1.. . . ,101. Classification of each time step requires 0.08 sec 011 a Sun Sparc I I T workstation. Classification performance is measured by determining the number of time steps for which I t is correctly identified and dividing by 364, the total number of time steps. Thus we obtain two figures of merit: one for the Bayesian and one for the lCRA method. The results, for various noise levels A, are summarized in Figure 4. We see that in the noisefree case both schemes perform very well, the Bayesian scheme slightly outperforming the ICRA scheme. However, the ICRA scheme is more robust to higher noise levels. In the second set of experiments we want to evaluate classification performance when the actual ( I parameter is izof in our search set. To this end we train 10 liizenr predictors 011 o = 3.0.3.1.. . . .3.9 values. Training time per predictor was slightly over 1 sec on a Sun Sparc IPC workstation. Then we generate five 364-step test logistics with an (I transition at step 182. The (1 transitions are 3.7-6tr to 3.9+6tr, where h o takes the values
Time Series Classification
0.90
-
*f
, . I .....I
0.70
_ . . . I
..... --.. . I . . I
I I I
0.80 I7 0.60
l...l...l...
I
-
0.80
367
I I - 8
I
0.40
I
1
Figure 3: Logistic classification for ten sources (cu = 3.0.3.1.3.2,. . . .3.9), t = 1.2.. . . .364. The solid line corresponds to q: ( a = 3.8) and the dotted line corresponds to 4: ( a = 3.6). For k # 7, 9 4: go to zero very rapidly and are not discernible in the figure.
-
1.00 -m .%
b
0.90
E
0.80 *.
c 0
0.70 *.
*
-
;" 0.60 * -
figure of merit
CI)
F 0.50 C
.P w
Bayesian classification
0.40 * -
.O 0.30
z
D
0.20 * 0.10 0.00
a.
:
:
:
:
Figure 4: Figures of merit for logistic classification at various noise levels. A denotes the noise level. Here ICRA figure of merit (respectively Bayesian figure of merit) denotes fraction of correctly classified time steps (out of a total 364) by ICRA (respectively Bayesian) scheme. We observe that while in the noise-free case the Bayesian scheme performs slightly better than the ICRA scheme, ICRA is more robust to noise. (In all experiments we use h = 0.01, = 0.25, y = 0.99.)
Vassilios Petridis and Athanasios Kehagias
368
-
-
VI
?!
n
Z
ICM classification figure of merit
0.60
0.50
Bayesian classification figure of merit
0.00
J
0.00
I 0.01
0.02
0.03
0.04
8a
Figure 5: Logistic classification for N outside the search set. ICRA figure of merit (respectively Bayesian figure of merit) denotes fraction of correctly classified time steps (out of a total 364) by ICRA (respectively Bayesian) scheme. This is plotted against difference 6n. We observe that when the actual cy values are in the search set (60 = 0.0) the Bayesian scheme is slightly better than the ICRA scheme. However, ICRA is more robust to increased 6 a . (In all experiments we use h = 0.01, rn = 0.25, y = 0.99.)
0.00, 0.01, 0.02, 0.03, 0.04. Hence 6tr measures the difference between the (1 on which we trained our search set and the actual a value, which generates the test time series. Note that for 6a = 0.05 we get cv = 3.65, exactly halfway between the search set a’s 3.6 and 3.7. All the other parameters of the experiments are the same as in the previous paragraph. With the exception of the first case, the trite values of a are not in our search set. The results of these experiments are summarized in Figure 5. Classification at a time step is considered to be correct when the time series is classified to the value of cr in the search set, which is closest to the true value of a. In other words, for all five time series correct classification should be (Y = 3.7 for the first 182 steps and a = 3.9 for the last 182 steps. In Figure 5 we plot classification figure of merit vs. ha. We see that the ICRA scheme performs better than the Bayesian scheme: it is more robust to parameter variations. Of course, an additional conclusion of this experiment set is that classification can be successfully performed using linear predictors. Finally, let us note that classification of each time step requires 0.04 sec on a Sun Sparc IPC workstation. 5.2 Enzyme Classification. This experiment involves classification of the /?-lactamase enzymes. The data and problem are described in Pa-
Time Series Classification
369
panicolaou and Medeiros (1990); here we give a short overview. /ILactamases determine resistance to /3-lactam antibiotics. Classification of ,5’-lactamases is a problem that has received considerable attention by biomedical researchers. A classification method, presented in Papanicolaou and Medeiros (1990), uses an ”inhibition” experiment. The $-lactamase enzyme causes hydrolysis of a chemical called nitrocefin, and the B-lactam slows hydrolysis down by inhibiting the action of the enzyme. In the following paragraphs we use the terms enzyme (in place of 0-lactamase) and inhibitor (in place of p-lactam). For every enzyme/inhibitor pair an “inhibition profile” is obtained, which (for a given inhibitor) characterizes the enzyme. This method has a high classification success, but the following problem occurs: the properties of enzymes and inhibitors heavily depend on the conditions under which they are prepared, and this results in varying inhibition profiles for different preparations of the same enzyme/inhibitor pair. However, some dynamic properties of the profile remain invariant; in Papanicolaou and Medeiros (1990) it is reported that enzyme classification depends on the slope of the inhibition profile at various times during the experiment, as well as on the final concentration of nitrocefin. This information was used by a human operator, who classified the enzyme by combining the various characteristics of an inhibition profile. We use the ICRA and Bayesian schemes to automate the enzyme classification process. The inhibition profiles are used as input time series. Eight enzymes are classified. The data set of inhibition profiles is separated into a test set and a training set.4 We use two data sets, consisting of inhibition profiles for two different inhibitors and all eight enzymes. In Figure 6 we plot inhibition profiles for three enzymes from the training set and the same three enzymes from the test set. In all cases the same inhibitor has been used. It is noted that for the same enzyme, the test profile can differ significantly from the training profile, for the reasons explained in the previous paragraph. For each enzyme a sixth-order linear predictor is trained on the corresponding inhibition profile from the training set. (These profiles are 40 min long time series; each time step represents 0.5 min of real time.) This is the offline training phase, which takes less than 1 sec per predictor on a Sun Sparc IPC workstation. Mean square prediction error is approximately 0.05 for all profiles. Next, we choose an inhibition profile from the test set and proceed to determine the enzyme it corresponds to. Both Bayesian and ICRA schemes are used; in Figure 7 we present 9: evolution for a particular enzyme inhi8 bition profile. In this task final classification uses values qi0, qi0.. . . .q40 (p40, 1 p40, 2 . . . ,pto, respectively). Classification performance of the Bayesian scheme is measured by cp, the number of correctly classified enzymes (at time t = 40 min) divided by eight, the total number of enzymes. A similar number, cq, is computed for the ICRA scheme. For the Bayesian 4We want to thank G. A. Papanicolaou for kindly permitting us to use the inhibition profile data.
Vassilios Petridis and Athanasios Kehagias
370
Figure 6 3.50
la
3.00
-
-
2a
2.60
.5 2.00 .n .5 1.50
3a
-
lb
-
1.oo
--t
3b
0.50 0.00
2b
1 1
0
11
16
21
28
31
36
Time (Minsl
Figure 6: Enzyme inhibition profiles tor enrvmes 1, 2, and 3. la, Za, 3a are training data, lb, 2b, 3b are test data. scheme we find ci, = 0.875, i.e., seven out of eight enzymes were correctly classified. For the ICRA scheme we find cC7= 1.000, i.e., all eight enzymes were correctly classified. Therefore, in this experiment the ICRA scheme classifies better than the Bayesian scheme. Classification of each time step requires 0.03 sec on a Sun Sparc IPC workstation. 6 Conclusions
We have presented ICRA, an incremental credit assignment scheme for time series classification. ICRA is implemented by a recurrent, hierarchical, modular neural network that consists of a decision module and a bank of predictive modules. The decision module implements a gaussian function s ( ~ (where ) e is prediction error) but any function g ( . ) can be used, as long as it is a decreasing function of /el. The predictive modules can be sigmoid, linear, gaussian, etc., feedforward networks. In fact, because of the competitive nature of the ICRA scheme, classification depends on relntiw, not nbsolirtr predictive performance, making ICRA robust to noise and prediction error. We have proven that under mild conditions, ICRA converges to the correct result, i.e., it detects the time series source that best predicts the observed data. The ICRA classifier is recursive, appropriate for online time series classification, which must be updated at every time step, taking into account past classification as well as the dynamic behavior of the time series. ICRA is modular and parallelizable, which means that offline training (of the predictor modules) as well as online operation scale linearly with the number of classes handled. No online training is necessary. Hence, to train and classify 100
Time Series Classification
1.00
-
..- - - - .- -.....- ......-.... . .. .. .-.......- _ _ ......~. ..........- - .....
0.90
*. ,,.
0.80
-. ,'; '. ; :
0.70 0.60 0 0.60
0.40
371
a.
*. /
-
0.30 * ,' 0.20 0.10 * ' 0.00 1 1
6
11
16
21
26
31
36
Time (Mind
Figure 7: Enzyme classification. The dotted line corresponds to the credit function 4: of the correct enzyme. The solid line corresponds to overlapping plots of 4:. . . . ,q:. Classification is based on the final value of 4:. logistics would take 10 times as long as to train and classify 10 logistics; in principle there is no limit to the number of classes that can be handled. Online operation time is O ( K ) (where K is the number of classes) for serial operation and 0(1) for parallel operation, i.e., all per step classification times reported in the previous section would be reduced by approximately 1/K if ICRA was implemented in parallel. The above paragraph summarizes the basic features of ICRA classification. These also hold for the Bayesian classifier of Section 2. However, the experiments of Section 5 indicate that ICRA classification is more accurate and robust than Bayesian classification. In addition, unlike Bayesian, the ICRA classifier can be implemented using only adders and multipliers; hence a simple and fast hardware implementation is possible. This is a further advantage over the Bayesian classification scheme, which requires a more complicated implementation. In short, the advantages listed in this and the previous paragraph make ICRA an attractive recursive method for time series classification problems, where past classification results must be used for future classification, and classes are given in advance. References Ayestaran, H. E., and Prager, R. W. 1993. The Logical Gates Growing Network. Tech. Rep. CUED F-INFENG TR 137, Cambridge University Engineering Dept.
Vassilios Petridis and Athanasios Kehagias
372
Baxt, W. G. 1992. Improving the accuracy of an artificial neural network using multiple differently trained networks. Neirrnl Coiqr. 4(5), 772-780. Billingsley, 1’. 1986. Probnbility nird Mrasrir~.John Wiley, New York. Farmer, J. D., and Sidorowich, J. S. 1988. Esploitiiig c h s to predict the future mid reduce rioise. Tech. Rep. LA UR 88 901, Los Alamos National Laboratory. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. Irztratficctiori to the Theory of Neirrnl Coiiipirtatioir. Addison-Wesley, Redwood City, CA. Hilborn, C. G., and Lainiotis, D. G. 1969. Optimal estimation in the presence of unknown parameters. I € € € Trnm.Syst. Scr. Cyberri. SSC-5, 3843. Jacobs, R. A., ef i l l . 1991. Adaptive mixtures of local experts. Neitrol Conrp. 3(1), 79-87. Jordan, M. I., and Jacobs, R. A. 1992. Hierarchies of adaptive experts. In NIPS 4, J. Moody, S. Hansen, and R. Lippman, eds. Morgan Kaufmann, San Mateo, CA. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neiirnl Coriip. 6(1), 181-214. Kadirkanianatlian, V., and Niranjan, M. 1992. Application of an architecurally dynamic neural network for speech pattern classification. Proc. Irist. Acoustics 14(6), 343-350. Lainiotis, D. G., and Plataniotis, K. N. 1994. Adaptive dynamics neural network estimation. In Proc. ICNN, Vol. 6, pp. 47364745. Moody, J. 1989. Fnst Lrnrrriirg i r i Mirlti-Rrsolutioiz Hicrnrchics. Tech. Rep. YALEU DCS RR 681, Dept. of Computer Science, Yale University. Neal, R. M. 1991. Bn!/esiarr Mistzirr Moddliirg by Monte Cnrlo Sirmilnfioii. Tech. Rep. CRG-TR-91-2, Dept. of Computer Science, University of Toronto. Nowlan, S. J. 1990. Maximum likelihood competitive learning. In NIPS 2, D. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Papanicolaou, G. A., and Medeiros, A . A. 1990. Discrimination of extendedspectrum j-lactamases by a novel nitrocefin competition assay. Aiztiiriicrob. A p r t s Chr7nothtv’. 34(11), 2184-2192. Perrone, M. P., and Cooper, L. N. 1993. When networks disagree: Ensemble methods for hybrid neural networks. In Nerirnl Nrticwks for Speecli m i d linage P r o c m i i ~ ~ R. y , J. Mainmone, ed. Chapman-Hall, London. Rabiner, L. R., and Schafer, R. W. 1988. Digital Processiris of Speech Signnls. Prentice-Hall, Englewood Cliffs, NJ. Schwarze, H., and Hertz, J. 1992. Generalization in a large conimmittee machine. Preprint, The Niels Bohr Institute. Shadaian, R. S., and Niranjan, M. 1994. A dynamic neural network architecture by sequential partitioning of the input space. Nt’rird Corrzp. 6(6), 1202-1222. .~ __ -~~ Received June 7, 1994, accepted June 1-1, 1995 ~
This article has been cited by: 2. T. McConaghy, H. Leung, E. Bosse, V. Varadan. 2003. Classification of audio radar signals using radial basis function neural networks. IEEE Transactions on Instrumentation and Measurement 52:6, 1771-1779. [CrossRef] 3. A. Kehagias, V. Petridis. 2002. Predictive modular neural networks for unsupervised segmentation of switching time series: the data allocation problem. IEEE Transactions on Neural Networks 13:6, 1432-1449. [CrossRef] 4. Jyh-Da Wei, Chuen-Tsai Sun. 2000. Constructing hysteretic memory in neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:4, 601-609. [CrossRef] 5. V. Petridis, E. Paterakis, A. Kehagias. 1998. A hybrid neural-genetic multimodel parameter estimation algorithm. IEEE Transactions on Neural Networks 9:5, 862-876. [CrossRef] 6. Athanasios Kehagias , Vassilios Petridis . 1997. Time-Series Segmentation Using Predictive Modular Neural NetworksTime-Series Segmentation Using Predictive Modular Neural Networks. Neural Computation 9:8, 1691-1709. [Abstract] [PDF] [PDF Plus] 7. V. Petridis, A. Kehagias. 1997. Predictive modular fuzzy systems for time-series classification. IEEE Transactions on Fuzzy Systems 5:3, 381-397. [CrossRef]
Communicated by Peter Konig
Temporal Segmentation in a Neural Dynamic System David Horn Irit Opher School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
Oscillatory attractor neural networks can perform temporal segmentation, i.e., separate the joint inputs they receive, through the formation of staggered oscillations. This property, which may be basic to many perceptual functions, is investigated here in the context of a symmetric dynamic system. The fully segmented mode is one type of limit cycle that this system can develop. It can be sustained for only a limited number n of oscillators. This limitation to a small number of segments is a basic phenomenon in such systems. Within our model we can explain it in terms of the limited range of narrow subharmonic solutions of the single nonlinear oscillator. Moreover, this point of view allows us to understand the dominance of three leading amplitudes in solutions of partial segmentation, which are obtained for high n. The latter are also abundant when we replace the common input with a graded one, allowing for different inputs to different oscillators. Switching to an input with fluctuating components, we obtain segmentation dominance for small systems and quite irregular waveforms for large systems. 1 Introduction
Segmentation is a concept that is generally invoked when one discusses the way an image has to be separated into its components. This can be a general two-dimensional scene (see, e.g., Pentland 1989) or some very specific task such as recognizing overlapping handprinted characters (Martin 1993). Segmentation is a precursor of object recognition in many sensual modalities. In addition to vision we may mention auditory signal separation as in the cocktail party effect (von der Malsburg and Schneider 1986) and odor separation in the olfactory bulb (Hopfield and Gelperin 1989). It is tempting to assume that segmentation, as well as binding, is implemented via a temporal mechanism for tagging the individual memories in the mixed input. There are at least two good reasons for this approach: It makes it easy to deal with distributed as well as overlapping Neural Computation 8, 373-389 (1996)
@ 1996 Massachusetts Institute of Technology
374
David Horn and Irit Opher
memories, and it is a natural mechanism when temporal structure exists in the input. Moreover, a temporal coding mechanism fits well within the cell assemblies approach, allowing for large variability. Single neurons need not be restricted to one feature, rather they can participate in different feature-representing assemblies each time a new stimulus is presented. Binding is performed naturally by such a temporal mechanism. Each cell assembly that represents a feature is composed of synchronized oscillating neurons. This phase locking serves as a binding mechanism, defining (tagging) groups of correlated neurons. The biological evidence for binding through phase locking comes from the well known results of Eckhorn ef nl. (1988) and Gray and Singer (1989). For more discussions of binding and segmentation in this context see, e g . , Engel rt id.(19911, KZjnig and Schillen (1991), Singer (19941, and Konig et nl. (1995). Temporal segmentation can be implemented in oscillatory networks using two very different mechanisms. One is to have each segment oscillate with a different frequency. Eckhorn (1994) has reported biological evidence for phase coupling of different frequency oscillations, corresponding to different elements of a visual scene. It remains then an open problem to understand the origin of the different frequency assignments. The aiternative, in the case where the different segments use the same frequency, is to have well-defined phase lags between them. In nonlinear oscillatory systems this can mean staggered oscillations with each segment well separated from the other ones. This is the method that we employ, because we restrict ourselves to the case of parallel retrieval of individual memorized patterns that belong to the same neural network, where it is only natural to assume that one common frequency dominates the behavior. Its possible application to scene analysis was discussed by von der Malsburg and Buhmann (1992). The implementation of temporal segmentation through staggered oscillations was demonstrated by Wang r f n l . (1990) and by Horn and Usher (1991). The way it works is that each one of the activities of the different memory-patterns that are turned on by the input is dominant for only a short while, a fraction of the cycle of the whole system. This way the temporal overlap between any two memory activities is close to zero, so we can regard them as being well separated. This behavior is obtainable in these models because of the nonlinearity of their oscillations. Both models have a limited segmentation power, i t . , for only a small number of common inputs they can lead to staggered oscillations. Assuming all inputs are constant and of similar magnitude, it turns out in these models that for an input of more than approximately five memories the system will collapse. One may be tempted to speculate that this could be related to the psychophysical limit on attention and short-term memory (Miller 1956). Horn et nl. (1992) have observed that when noise is added to constant inputs, the network can continue its staggered oscillations for very large numbers of objects. This comes about because noise fluctuations can
Temporal Segmentation in a Neural System
375
enhance momentarily the input of one of the Hebbian cell-assemblies, enabling it to overtake the other ones. Clearly a waveform displaying segmentation is a quite particular outcome of an oscillating neural system. Can we pinpoint the nonlinear property that enables its existence? Is there some way to quantify the importance of this mode of behavior? Can we devise a neuronal system in which it is the dominant mode? To study such questions we investigate a symmetric neural system, in which all possible waveforms can be classified by symmetry and followed through numerical simulations. This allows us to find the basin of attraction of segmentation, and compare it with those of other limit cycles. Moreover, by tracing the way this dynamic system operates, we find that the phenomenon of subharmonic oscillations, which can be obtained only in nonlinear systems, is responsible for segmentation. This leads to an understanding of the limitation on the number of segments that appear in this mode. Then, introducing symmetry breaking into the neuronal inputs, we find that segmentation modes play dominant roles, because of the lack of degeneracy in their waveforms. 2 Limit Cycles of a Symmetric Dynamic System
The emergence of the segmentation mode was investigated by von der Malsburg and Buhmann (1992) in a dynamic system of two oscillators with an inhibitory interneuron. In spite of the big parameter space that even such a simple problem possesses, segmentation is a very natural outcome. This could be different in systems with a larger number of oscillators, which we set out to investigate. To cut down the number of parameters we concentrate on a system in which each oscillator couples to itself and one common inhibitory unit. The oscillators are composed of excitatory neurons with dynamic thresholds that receive external inputs (Horn and Opher 1995):
du,/dt
= -u,
+ rn,
-
am - 0, + I,
dQi/dt = bmi - cQ; dvldt
=
-gv -em
(2.2)
(2.2)
+f E m ,
(2.3)
1
u, denote the postsynaptic currents of the excitatory neurons, whose av-
erage firing rates are
m, = (1 + e-PU’)p’
(2.4)
while v and m are analogous quantities for the inhibitory neuron that induces competition between all excitatory ones:
m
=
+
(1 e-P”)-’
(2.5)
David Horn and Irit Opher
376
b
t C
I
Figure 1: Limit cycles ot the H = 3 system. Parameters were a = 0.5, b = 0.4, c = 0.2, x : 0.1, L’ = 1 .l, f = 0.3, i= 9. The different m, are plotted vs. time after the system has reached stability. The time scale is arbitrary but is chosen to be the same in a11 figures. Each 112, is represented by a different symbol. The limit cycles are (a) fully synchronous, I = 0.8; (b) partial synchronous waveform, I = 0.4; ( c ) full segmentation, I 0.4.
. g and i are fixed parameters. #, are dynamic thresholds that rise when their corresponding neurons i tire. They quench the active neurons and lead to oscillatory behavior. Note that H , can also be interpreted as inhibitory linear neurons that pair u p with the excitatory u , to form nonlinear oscillators. The continuous neurons play a role analogous to the Hebbian cell assemblies in the model of Horn and Usher (1990). That model was expressed in terms of rate variables, and could be derived from an underlying picture of nonoverlapping cell assemblies. Here we resort to single neuronal units for simplicity of the analysis.
0 . ., .
Temporal Segmentation in a Neural System
377
All neurons are assumed to be under the influence of a common input I, = I. In this case the system is fully symmetric: it remains invariant under the interchange of any two excitatory neurons i H j . We will also assume that the common input is constant in time. In general, this dynamic system flows into a set of dynamic attractors. Thus, for 3 excitatory elements, one finds the following types of attractors: (a) common fixed point or common oscillatory mode; (b) two of the elements oscillate in phase and a third out of phase; and (c) staggered oscillations of all elements. The last type fits our understanding of temporal segmentation. Examples of all three types of limit cycles are shown in Figure 1. The parameters are fixed (as specified in Fig. 1) but for the value of I, which is chosen so as to obtain all types of behavior. It should be emphasized that the choice of parameters in these examples is quite arbitrary. The phenomena exist within a wide window of parameters. For example, dominance of modes b and c is obtained for 0.3 < I < 0.65 and 0.5 < a < 0.7. The solutions displayed in this figure, and throughout the present work, were obtained using the fourth-order Runge-Kutta method for integrating a set of differential equations with time steps of dt = 0.005. Smaller time steps led to the same results. By choosing a fixed set of parameters and varying over the initial conditions one can map the basins of attraction of the different limit cycles. An illustration of such a map is presented in Figure 2 where, for simplicity, we test a two-dimensional domain of initial conditions for parameters corresponding to Figure l b and c. The resulting structure is complicated, as expected when dealing with a nonlinear system with strong dependence on initial conditions. We can see, for example, in the upper right corner of the map that there are bright spots on the boundaries between two other areas that represent full segmentation waveforms. These are islands of waveforms of type b. The latter dominate other parts of this plane, as can be seen in the right lower corner and the left upper corner of the map. This complicated figure still conveys the message that each of the three waveforms of types b and two waveforms of type c have roughly the same size of basin of attraction. To estimate the sizes of the basins of attraction of all waveforms we have chosen random initial conditions over the whole seven-dimensional space of initial conditions of the n = 3 problem, and checked to which waveforms the system converges. Our interest is focused on the segmentation limit cycles. We found that the probability of converging onto the two waveforms of type c is 0.45 for the set of parameters specified in Figure lc. To perform such calculations one needs automatic means to recognize the limit cycles. This is where the symmetry of the problem becomes very useful. Starting from some initial condition the system flows into a limit cycle during a time period that is equivalent to several periods 7 of the limit cycle, as shown in Figure 3. It turns out that 7 is approximately the same for all limit cycles in this problem. Therefore, we wait for 207 until
378
David Horn and Irit Opher
Figure 2: Basins of attraction of different limit cycles in the I I = 3 problem. Parameters are the same as in cases (b) and (c) of Figure 1. Axes are initial conditions ot -0.6 < 112 < 0.6 and -0.6 < i i j < 0.6 with irl(0) = 1 tir(0) i r ? ( O ) and i i ( 0 j = 0.09. H , i O ) = 0. I, 11, and 111 denote the three possible partial synchronous waveforms (corresponding to the three possibilities of choosing 2 neurons out oi 3 ) . 1V and V denote the two possible full segmentation modes (corresponding to the two permutations of full segmentation). ~
we test the obtained limit cycle (for another 207-1 and match the obtained result with one of the finite set of limit cycles that we know the system can flow into. 2.1 Larger 1 1 Values. Turning to larger I I values we continue and stick to the particular parameters specified in the examples of = 3. 11 was defined as the number of excitatory neurons in our model. If instead of choosing a common input I! = I for all neurons we present the input only to a subset, then all other excitatory neurons will remain inactive. Hence, for all practical purposes, I J may be regarded as the number of excitatory neurons that are influenced by the common input. The symmetric model allows us to try and pinpoint the properties of the segmentation mode. In particular, we would like to understand how prominent it is. Using the same parameters for which the ii = 3 problem has segmentation probability of 0.45, we find for I Z = 2.4.5 the values 1, 0.20, and 0.56, respectively.
Temporal Segmentationin a Neural System
379
b
0
0
50.0
100.1
t
6
0.0
0.6
X
Figure 3: Transient motion into a limit cycle. Results for the three rn, values are shown (a) as function of time and (b) in a phase space with axes x = ( r n l -rn2)/2, y = (rnl + m2 - 2m3)/2&. The shown time scale corresponds to (a) 20.000 iterations and (b) 40,000 iterations.
This calculation necessitates the employment of the procedure explained above. For n > 3 we just check whether the solution is of the full segmentation type, Starting from random initial conditions, the accuracy of the calculation is limited only by the total number of trials employed, which was 10,000 for every n. The results depend on the parameters of the system. For example, the n = 4 segmentation probability increases to 0.80 for a = 0.65 and I = 0.5. In any case it seems that segmentation plays a quite dominant role for n = 4 and even 5. The latter is, however, the last n value for which a full segmentation mode is obtained. For n > 5 we find that some forms of effective segmentation often occur. This can happen in one of two ways (or a combination of both): (1) Degenerate segmentation, i.e., formation of clusters of amplitudes that move in unison in a segmentation pattern. An example of n = 6 is shown in Figure 4, where the amplitudes pair up to form three clusters. (2) Partial segmentation, where one finds leading amplitudes that display segmentation, and nonleading ones that have completely different behavior. An example for n = 8 is shown in Figure 5. In this example we find that the low amplitudes exhibit chaotic motion, yet the three large amplitudes follow a periodic segmentation course.
David Horn and Irit Opher
380
9
-1
t
Figure 4: Degenerate segmentation in the 12 = 6 problem: Formation of clusters of pairs of patterns that vary synchronously. Each mi is represented by a different symbol.
3 Segmentation and Subharmonic Oscillations
In the system of equations that we study here, the interaction between all elements is provided by the inhibitory unit (equation 2.3). The individual excitatory unit 1, described by equations 2.1 and 2.2, is influenced by all other units through the amplitude m of the inhibitory unit in equation 2.3. The behavior of each 172, in any given waveform can therefore be also viewed as the response of equations 2.1 and 2.2 to a driving term atn(t). All the waveforms that we encounter have an overall period 7 that is roughly the same as that of the free oscillator (a = 0 ) . In a segmentation mode m ( t ) oscillates with a period of T / I Z . This can be seen in Figure 6, where we display t i i l ( f ) and m ( t ) for the I I = 5 segmentation mode. If we think of m as the driving term then the phenomenon observed here is that of subharmonic oscillation that is known to exist in nonlinear oscillating systems (Mandelstam and Papalexi 1932; Hayashi 1964). The subharmonic oscillation exhibited in Figure 6 is of a very special kind:
Temporal Segmentationin a Neural System
381
Figure 5: A quasiperiodic solution of the n = 8 problem that displays partial segmentation. The three large amplitudes form a segmented pattern, while the low amplitudes (lines without symbols) form a chaotic background.
The width of the subharmonic amplitude ml is of the same order as that of the driving term, r / n . Let us refer to it as a narrow subharmonic solution. This is of course necessary for the segmentation structure to occur, since the latter is constructed from amplitudes mi that are repeated recurrences of ml shifted by multiples of r / n . This is the only way n nonoverlapping amplitudes can be accommodated in one cycle. Each of the mi corresponds to a particular choice of phase out of the n-fold degeneracy of 1/n ordered narrow subharmonic solutions. A stable linear system follows the frequency of the driving term. Only nonlinear systems exhibit different periodic solutions, including the subharmonic ones that are of interest to us. The nonlinear characteristics of the system determine the possible orders of the subharmonic oscillations. In particular, a subharmonic solution of order l / v is known to occur when one of the terms of the nonlinear function has the power U. In our case the nonlinearity is that of a sigmoid function, which, in principle, contains all powers. To test directly this idea, we ran the system of
David Horn and hit Opher
382
-1
Figure 6: The variation of mi (solid line) and vz (the inhibitory amplitude, dotted line) in the i i = 5 fully segmented limit cycle. equations 2.1 and 2.2 with a constaiit+sinusoidal driving term replacing m. In other words, we investigated solutions of the set of equations dirl,idt = - - i l l
- ll
- 1121 - Ir - /Il
d H l / l i t = h1lZI - C H I
(3.1) (3.2)
in which h i t ) is chosen to be similar to nz of Figure 6 but with a tunable frequency. We were able to generate narrow subharmonic solutions of I /2 to 1/ 5 . From 1/ 6 onward the subharmonic solutions were no longer narrow, i.e., the t n I amplitude generated by such a driving term has a width that is considerably larger than T ! ’ i z . This explains why full segmentation is limited in our system to 11 5 5. Higher II values cannot sustain the narrow subharmonic solution needed to build up segmentation. It is interesting to find the stability of the subharmonic oscillations. We tested it in two ways. First we ran the system with variations of the frequency +’ of the pure sinusoidal driving term and measured the window J L ~for L which subharmonic oscillations were obtained. The results show large ranges for the subharmonics 1/2 and 1/3, for which the relevant values are 0.32 and 0.11, respectively. The other subharmonic
Temporal Segmentation in a Neural System
383
solutions were obtained for considerably smaller frequency windows. The results for 1/4,1/5; and 1/6 are 0.03,0.015, and 0.013, respectively. Then we tested stability of the subharmonic solution against the mixing of a lower frequency w - Sw with the frequency w, which is the driving term of the nth subharmonic. Surprisingly the 1/2 solution turns out to be unstable, whereas the higher solutions of 1/3,1/4, etc. have a range of stability of the order 6w/w E 0.2. These results explain why in partial segmentation solutions, observed for high n values, three segments of leading amplitudes are dominant. The small amplitudes add a varying background to the driving term created by the large amplitudes, the ones responsible for the subharmonic solution. The fact that the 1/3 subharmonic solution has a large frequency window and is stable against admixture of several frequencies is the reason for the dominance of structures like the one displayed in Figure 5. 4 Breaking the Input Symmetry
The system that we have considered so far was totally symmetric. The limit cycles can be viewed as spontaneously breaking this symmetry. Each waveform is invariant under some symmetry group that is a subgroup of the direct product of the permutation symmetry-under which the dynamic system (equations 2.1-2.3) is invariant-and time translation symmetry. This was investigated by us recently in some detail (Horn and Opher 1995) following a known methodology of dynamic systems. The segmentation mode has the important characteristic that no residual degeneracy is left in it. This is of particular importance when we run our system with different inputs to different units. In fact, it is the reason that segmentation modes become dominant under such circumstances. We introduce symmetry breaking by modifying the inputs to allow small deviations that, at this stage, are kept constant in time: I , = I + F!. For small perturbations we find that the waveforms of the different limit cycles are only slightly modified and thus we can continue to use the specifications of the symmetric problem. Yet the basins of attraction change considerably. In particular, the basin of any mode that has previously displayed a residual degeneracy gets strongly reduced when the degeneracy in the input is being removed. Let us return to the n = 3 case of Figures 1 and 2 to show what happens when the input symmetry is broken. Starting with 1 = 0.4 and letting c1 grow, we find that the probability of flowing into a segmentation waveform increases rapidly, as displayed in Figure 7. It saturates at 0.8 because there are two attractors of type b whose basins of attraction shrink to zero. The remaining attractor of type b whose basin stabilizes around a size of 0.2 is the one that has residual permutation symmetry between amplitudes 2 and 3, a symmetry that is still maintained by our symmetry breaking term.
384
David Horn and Irit Opher
6
1 0.00
0.01
,
1 0.02
1 0.03
1 0.04
1 0.05
, 0.06
Figure 7 Change of the size of the basin of attraction of the segmentation waveforms in the T I = 3 problem for 1, = 0.4 + 6,,lcl. For higher values of IZ and various symmetry breaking patterns we have obtained effective segmentation modes of different types. These include partial segmentation modes with three major amplitudes, like the example shown in Figure 8. Other possibilities are variations of degenerate segmentation modes, in which the degeneracy of the amplitudes is lifted but phase degeneracy is kept intact. Our general conclusion is that effective segmentation modes dominate when the input symmetry is broken.
5 Fluctuating Inputs
Horn et al. (1992) have noticed that in the system that they have studied fluctuating inputs were able to support segmentation of a large number of oscillators. Trying to test the generality of this phenomenon, we have subjected our system to such conditions. Our question is what happens when we replace the constant symmetry breaking terms of the previous chapter with fluctuating terms whose average is the same for all excitatory neurons.
Temporal Segmentation in a Neural System
385
t
Figure 8: Waveforms of an n
= 8 system with graded inputs. I, were chosen as 0.6, 0.575, 0.55, 0.525, 0.5, 0.475, 0.45, 0.425.
A typical example of fluctuating inputs that we used is I; = 0.4+0.1~1,, where 7, is a random variable between 0 and 1, which varies with a random temporal distribution whose average is of the order of 7/20. In spite of these variations, we obtain for n = 3 a very regular segmentation structure, hardly distinguishable from that shown in Figure lc. Given the fluctuating input one cannot define limit cycles for such a system. Yet, the stability of the full segmentation mode is striking. In fact we may say that this random perturbation changed the probability of flowing into segmentation modes from 0.45 to about 1. This is, however, true only for n = 3. Moving to n = 4 we find, for the same type of variable input, approximate segmentation structure with about the same probability as for the constant common input. The resulting waveforms are no longer as regular as in the noiseless case. For the purpose of analysis we found it useful to move from inspection of waveforms to analysis of correlations, which we define through
where the integral is carried out over a time period T >
T
after some
David Horn and Irit Opher
386
0
C
Figure 9: Example of irregular segmentation behavior for the with inputs of the type I, = 0.4 A 0.3r/,.
II
=5
problem
transient time f , has elapsed. Regularity may emerge only if we allow for an integration time T that is much larger than the characteristic period. Integrating over long time scales we find in the case of iz = 4 that the fact that the system of differential equations is symmetric on average reflects itself in an effective symmetry of the correlation matrix elements. Increasing the random component to 0.2 and 0.3 this symmetry gets spoiled and the regular structures disappear. For higher I I values this symmetry breakdown occurs for lower values of the random component. The general emerging pattern can be described as irregular segmentation. An example is shown in Figure 9. Although we find the phenomenon of temporary dominance by a single amplitude, or a pair of amplitudes, all regularity of relative phases is lost. Only the 12 = 2 and H = 3 systems display the dynamic stability of regular segmentation, making it the dominant mode in the presence of fluctuating inputs.
Temporal Segmentation in a Neural System
387
6 Discussion
The waveforms corresponding to a segmentation solution are sometimes referred to in the literature as rotating waves or ponies on a merry-goround. An example of the latter can be found in Ermentrout (1992), who discusses a neural system composed of several excitatory neurons and one inhibitory one that, for some choice of parameters, exhibits such solutions. In his system, as well as in many other models of nonlinear oscillators that exhibit this phenomenon, the excitatory elements interact with one another in a fashion that corresponds to nearest neighbors on a ring. The rotating wave solution fits then the topology of the interaction matrix. The symmetric model that we have discussed has the topology of a star (with inhibition at the center) leading to many possible limit cycles. The segmentation mode is an extreme case of spontaneous breaking of the original symmetry. Yet it plays a special role. Its importance lies in the fact that the degeneracies between the different oscillators are lifted, hence it is stable against symmetry breaking of the input. Therefore we find that full segmentation in small systems, or partial segmentation in big systems, are favored limit cycles under these conditions. We view segmentation as an important oscillation mode because of the special cognitive role it may play in the analysis of mixed signals. Therefore it is important to understand the limit on the number of segments. In the system that we have discussed the limit is related to the inability of invoking narrow subharmonic solutions of n > 5. By studying the stability of subharmonic solutions we find the reason why usually three large amplitudes are involved in partial segmentation of high n systems. In the Introduction we raised the question of when segmentation is the favored waveform. We find that effective segmentation modes are preferred when we use graded inputs in our otherwise symmetric system. One may wonder what happens when the symmetry is broken in the interaction matrix, i.e., if we introduce in (1) nondiagonal interactions of the type du,ldt = -u,
+
W,,m, - am - 61 + I ,
(6.1)
I
For this type of symmetry breaking segmentation is not guaranteed. Nondiagonal excitatory interactions can lead to phase-locking. This is, after all, the procedure of forming cell assemblies. For particular cases, e.g., W, = 0 if W,, > 0, ring-type interactions are formed that can lead to segmentation with a specified order, of the kind discussed at the beginning of this chapter. A combination of these two types of effects leads to degenerate segmentation. Yet it cannot be generally stated that segmentation is the dominant effect of symmetry breaking. Finally we have looked for the stability of segmentation in the case of fluctuating inputs. We find that only the cases n 5 3 exhibit stability and dominance of pure segmentation. Noisy inputs in high n systems lead to
388
David Horn and Irit Opher
irregular segmentation patterns, which may be sufficient for some types of signal separation processes, but they n o longer possess any regularity in the structure of the resulting waveforms. Dealing with nonlinear dynamic systems it is difficult to derive general results. It is therefore important to elucidate the dynamic behavior in a system that can be studied in detail a n d where the interesting features can he discerned by the numerical analysis. This w a s o u r goal when we set out to study temporal segmentation in a symmetric neural system. Clearly many of o u r results are specific to o u r model. Nonetheless, we believe that the connection that w e have m a d e with subharmonic oscillations, a well known property of nonlinear oscillators, can serve as a qualitative understanding of the behavior of temporal segmentation. In particular, the effective limit of small numbers, allowing only few segments to appear in temporal segmentation, seems to hold true in different oscillatory systems in which this phenomenon w a s investigated. The subharmonic explanation that can be provided in o u r model throws new light on the reason for this general observation.
References Eckhorn, R. 1994. Oscillatory and non-oscillatory synchronization in the visual cortex and their possible roles in associations of visual features. Prog. Brniti Rrs. 102, 405-426. Eckhorn et nl. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d . C!y!Jer!i. 60, 121-130. Engel, A. K., Konig, P., and Singer, W. 1991. Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acnd. Sci. U.S.A. 88, 9 136-91 40. Ermentrout, 8 . 1992. Complex dynamics in winner-take-all neural nets with slow inhibition. Neirrul Netiivrks 5, 415-431. Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Pror. Nafl. Acntf. Sci. U.S.A. 86, 1698-1 702. Hayashi, C. 1964. No!dIrirar Oscillntiolzs in Physical Systettis. Princeton University Press, Princeton, NJ. Hopfield, J. F., and Gelperin, A. 1989. Differential conditioning to a compound stimulus and its components in the terrestrial mollusc Limax maximus. Behav. Neurosci. 103, 329-333. Horn, D., and Opher, I. 1995. Dynamical symmetries and temporal segmentation. 1. Nonlinenr Sci. 5, 359-372. Horn, D., and Usher, M. 1990. Excitatory-inhibitory networks with dynamical thresholds. Int. 1. Nriiml Syst. 1, 249-257. Horn, D., and Usher, M. 1991. Parallel activation of memories in an oscillatory neural network. Neiirul Cottip. 3, 31-43. Horn, D., Sagi, D., and Usher, M. 1992. Segmentation, binding and illusory conjunctions. Neiirnl Cottip. 3, 509-524
Temporal Segmentation in a Neural System
389
Konig, I?, and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: Synchronization and desynchronization. Neural Comp. 3, 155-178. Konig, P., Engel, A. K., Roelfsema, I? R., and Singer, W. 1995. How precise is neural synchronization? Neural Comp. 7, 469485. Mandelstam, L., and Papalexi, N. 1932. Uber Resonanzerscheinungen bei Frequenzteilung. Z. Phys. 73, 223-248. Martin, G. L. 1993. Centered object integrated segmentation and recognition of overlapping handprinted characters. Neural Comp. 5, 419429. Miller, G. A. 1956. The magical number seven, plus or minus two: Some limits on our capacity of processing information. Psychol. Rev. 63, 81-97. Pentland, A. 1989. Part segmentation for object recognition. Neural Comp. 1, 82-91. Singer, W. 1994. Coherence of cortical functions. Int. Rev. Neurobiot. 37, 153-183. von der Malsburg, C., and Buhmann, J. 1992. Sensory segmentation with coupled neural oscillators. B i d . Cybern. 67, 233-242. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail party processor. Biol. Cybern. 54, 2940. Wang, D., Buhmann, J., and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neural Comp. 2, 94-106.
Received December 5, 1994; accepted July 10, 1995.
This article has been cited by: 2. M. Ursino, E. Magosso, C. Cuppini. 2009. Recognition of Abstract Objects Via Neural Oscillators: Interaction Among Topological Organization, Associative Memory and Gamma Band Synchronization. IEEE Transactions on Neural Networks 20:2, 316-335. [CrossRef] 3. Seung Han, Won Kim, Hyungtae Kook. 1998. Temporal segmentation of the stochastic oscillator neural network. Physical Review E 58:2, 2325-2334. [CrossRef]
Communicated by Errki Oja
Circular Nodes in Neural Networks Michael J. Kirby Rick Miranda Deprfriieizt ofMntlicr?~ntic.s,Colorndo S f n t c Uiiii~ersrty,Fort Collirzs, CO 80523 U S A
In the usual construction of a neural network, the individual nodes store and transmit real numbers that lie in an interval on the real line; the values are often envisioned as amplitudes. In this article we present a design for a circular node, which is capable of storing and transmitting angular information. We develop the forward and backward propagation formulas for a network containing circular nodes. We show how the use of circular nodes may facilitate the characterization and parameterization of periodic phenomena in general. We describe applications to constructing circular self-maps, periodic compression, and one-dimensional manifold decomposition. We show that a circular node may be used to construct a homeomorphism between a trefoil knot in x7 and a unit circle. We give an application with a network that encodes the dynamic system on the limit cycle of the Kuramoto-Sivashinsky equation. This is achieved by incorporating a circular node in the bottleneck layer of a three-hidden-layer bottleneck network architecture. Exploiting circular nodes systematically offers a neural network alternative to Fourier series decomposition in approximating periodic or almost periodic functions. 1 introduction
-
A sigmoidal node of a standard feedforward neural network stores information as a real number in a bounded interval. Thinking geometrically, such a node is capable of encoding a point in the one-dimensional manifold I that is the open interval. Up to homeomorphism, there are only two one-dimensional manifolds: the open interval I and the circle S'. The open interval is capable of and appropriate for encoding amplitude information; the circle is capable of and appropriate for encoding angular information. In this article we propose a "circular" node that is capable of encoding angular information and give one implementation of it that we have found expedient. We present both the forward propagation formulas and the backpropagation algorithm for a network containing circular nodes. We describe several applications in pattern analysis and the analysis of high-dimensional dynamic systems at the end of this article, culminating \rtwi~il
Coinpictillioti 8, 390-402 (1996)
@ 1996 Massachusetts Institute of Technology
Circular Nodes in Neural Networks
391
in a neural network parameterization and compression of the limit cycle of the Kuramoto-Sivashinsky partial differential equation. This is an extension of the neural network description of the limit cycle for the Van der Pol oscillator, described in Kirby and Miranda (19944. Another approach for encoding of angular information in neural networks has been presented in Zemel et al. (1995). The reader may also be interested in the approach of Gislkn et al. (1992) that uses rotor neurons to process data lying on an n-sphere. 2 Implementation of Circular Nodes
In our implementation, a circular node is actually a pair of coupled nodes, whose values are constrained to lie on the unit circle. To be specific, we suppose that the neural network has L layers, numbered from 0 to L - 1. In each layer there are nodes, which can be either of two types: circular or sigmoidal. The jth node in layer i will be denoted by A((’).The number of nodes in layer i is “‘1, and for fixed i the nodes A((’)are indexed from j = 0 to j = N(’)- 1. At each node A((’)the state value of the network is denoted by S:”. The circular nodes occur in coupled pairs, i.e., if node (node j in layer i) is part of a circular node, then there will be a coupled node N::) [node TO’), also in layer i]. Thus there are an even number of such nodes, in coupled pairs; we assume that T [ T ~ )=]. ] By construction the state values for coupled circular nodes will always satisfy the circular constraint
4‘’)
( s y + (s:;,)2
(2.1) 1 We think of each pair of such coupled nodes as a single ”abstract circular” node, whose joint state value represents true angular information. Note that the use of two unconstrained sigmoidal nodes is not equivalent and the resulting parameterization will fail to be circular. This implementation of a circular node fits naturally into other types of networks also (eg., ones with linear and sigmoidal units, or ones with feedback); we examine the circular node in this simplified yet fundamental situation to isolate its unique features. =
3 The Forward Propagation of the Network
In the architecture described above, each state value in the network S;” is determined from the state values Sl-’)occurring in the previous layer. The inputs are the state values S,@) at the nodes of the input layer 0. Each node has associated to it weight constants w;;) and bias constants b:’). For each node A((’) with i 1 1 (that is, not at the input
q(‘)
Michael J. Kirby and Rick Miranda
392
layer), we define a prestate value PI’” by the formula
k=O
This first step is the same for all nodes, whether they are circular or sigmoidal. Note that the bias constants b,(” are defined for all i = 1, . . . , L1, while weight constant w[;’ are defined for all i = O, . . . , L - 2. The final state values are then determined by the prestate values using different formulas depending on the type of node: if
+(I1
is circular
if
+(I’
is sigmoidal
(3.2)
Cry(P/(’) where
Note that in the circular case, the state values automatically satisfy the circular constraint 2.1. It is useful to define the radial value. Let
q‘’)
if is circular. Note that R;” be rewritten as
s“’= J
=
Ryi,, and the state formula can then
p(lJ/R(l)
J
J
in the circular case. This implementation of circular nodes requires only a relatively minor alteration of existing network architectures. 4 Error and Gradient Computations via Back-Propagation
Given a set of input state values Sjo’ (which then determine the full set of state values using the forward propagation described above), one has a desired set of goal values G,, as j runs from 0 to N(L-l)- 1, i.e., over the output nodes. The total squared errur E in the network for this state S (relative to this goal G ) is the total squared error between the actual state values SI(L-’) at the output nodes and the goal values G, there
Circular Nodes in Neural Networks
393
If more than one state is used to form an average error, the gradient of the error simply sums over all these states; hence to avoid using an index to indicate the state, we will develop the gradient computations for a single state only; the reader may then take the appropriate sum for an average error over many states. We now develop the formulas for the backpropagation algorithm, which computes the gradient of the error with respect to the weights and biases. Hence we compute the values
for all i, j , and k. Standard application of the chain rule for derivatives leads to the formulas
and
It remains then to compute the derivatives 6'E/i3Sji', which is done recursively; at the start of the recursion, we have (4.4)
For other "lower" layer values, we have (4.5)
where
With these modifications the backpropagation algorithm proceeds as usual.
Michael J. Kirby and Rick Miranda
394
5 Applications In this section we will briefly describe several applications of circular nodes in the architecture of a neural network. We begin with certain prototypical constructions, which the reader may easily modify for more elaborate applications. 5.1 Prototypical Uses for Circular Nodes.
0
0
Parameterization of Periodic Phenomena: A locus r in P‘is said to be periodic if it is the image of a circle. A parniizeferizafion of this periodic locus is a function from the circle S’ to the locus r. A neural network designed to produce such a function would have a single circular node a s the input layer, one or more hidden layers, and an output layer consisting of J I sigmoidal nodes. Circular Self-Maps: Mappings from the circle to itself are approximated by neural networks having a single circular node as the input layer, one or more hidden layers and a single circular node as the output layer. A linear combination of networks that reproduce multiplication-angle formulas would explicitly approximate any term of a finite Fourier polynomial, and would then be a direct generalization of nonharmonic decomposition; this can be compared with Lapedes and Farber (1987). Periodic Compression: The converse to parameterization of a periodic locus is the idea of pn.iodir co~7ipressiooti,that is, taking a periodic locus r and mapping it to a circle (see, for example, Kirby and Miranda 1994b). indeed, the locus r need not be a periodic locus; in this case a mapping from I’ to S’ would simply be a ”circular feature extraction”; one might view this as a lossy compression of (almost) periodic data. Circular Remodeling of Boundary Value Problems: Suppose that U = F ( i r ) defines a boundary value problem, which is defined on an irregular open locus in F? with an irregular boundary r that is homeomorphic to the circle. If the interior of r is star-shaped, the existence of a compression mapping G : r‘ - S’ may be extended tct a mapping from the interior of r to the interior of the unit disc, and therefore used to transport the original differential equation to the interior of the unit disc, where solutions may be more easily computed via standard numerical techniques. One-Dimensional Manifold Decomposition: The natural generalization of single feature extraction, whether sigmoidal or circular, is multiple feature extraction, where several features of a data set are to be captured simultaneously. If some of the features are amplitudes of various outputs, and others are periodic, our viewpoint is that this is mathematically best expressed a s a mapping from the
Circular Nodes in Neural Networks
395
pattern set r to I" x ( S ' ) k ; the amplitude features are encoded in the n interval coordinates of the I" factors, and the angular features are encoded in the k circular coordinates of the (S'Ik factors. Such a mapping can be considered as a one-dimensional manifold decomposition of I?. 5.2 Applications to Pattern Analysis. Let r c E@' be a data set that one wants to optimally compress. Linear methods (principal component analysis, or the Karhunen-Lo6ve decomposition, see Devijver and Kittler 1982) discover an optimal ordered coordinate system in R~ such that l? lies in the subspace defined by the first few coordinates; this works very well and is efficient if r fills up an open subset of a linear subspace. , one must have nonlinHowever, if r is a nonlinear subset of R ~then ear coordinates to capture r completely and efficiently. Mathematically speaking, we are seeking a nonlinear function G : r + V where V is an open subset of R",and the mapping function G is continuous and invertible (albeit nonlinear). Thus there will be an inverse mapping H : V l? and a nonlinear representation of r may be thought of as giving inverse mappings G and H as above. The compressed manifold V may be more efficiently represented as lying in R'" x (S')k; this affords the opportunity for efficient angular feature extraction. -+
5.2.2 Bottleneck Architectures. Circular nodes fit naturally into the bottleneck architecture (see, e.g., Rumelhart and McClelland 1986; Oja 1991; Kramer 1991; Demers and Cottrell 1992; Krischer et nl. 1993; Doya and Selverston 19941, which permits a nonlinear principal component analysis. The use of circular nodes permits this type of architecture to discover both types of amplitude and angular features. The model for this type of network has three (or more) hidden layers consisting of nodes that may be of any of the types described here, i.e., linear, circular, or sigmoidal. This network can be used to train the identity mapping using backpropagation; one of the interior layers acts as the bottleneck layer, and provides the low-dimensional representation of the data. If one trains this network to reproduce the identity function on the data set r, the mapping G obtained from the input layer (of high dimension N) to the bottleneck layer (of low dimension m) will be an invertible and nonlinear modeling of the data set. Its inverse H is given by the mapping from the bottleneck layer to the output layer. 5.2.2 The Failure of the Sigmoidal. For truly periodic phenomena, one strictly needs a circle as a target for the compression map G; an interval (which is the natural output of a sigmoidal node) is not sufficient. This is because there is no invertible mapping from the circle to any interval 1. Hence if r is periodic with a bijective parameterization mapping H : S' -+
Michael J. Kirby and Rick Miranda
396
r, and one tries to construct a bijective compression map G : I'+ I , the comyosition G o H : S' I would be a bijective mapping and therefore could not be continuous. The corresponding network will not be able to reliably generalize. An example of this can be seen in Kramer (1991). A pattern set consisting of 100 points on the unit circle and the bottleneck architecture were used to construct a mapping from a circle to an interval. Kramer showed that this architecture trained quite accurately on the 100 data points on the circle. However, for the reasons mentioned above, it camof perform well for the general point on the circle. To demonstrate this we first trained the 2-4-1-4-2 bottleneck network (using all sigmoidal nodes) as above to high precision (average error = 0.001) on 20 points on the circle as shown in the upper graph of Figure 1. The removal of the discontinuity point on the circle leaves a pattern set that is topologically an interval, and therefore one can train such a network to arbitrarily high accuracy (using more hidden nodes if necessary) at the expense of missing the point of discontinuity. One clearly sees a discontinuity in the mapping functions when applied to the general point on the circle; this is illustrated in the lower graph of Figure 1. We remark that these results are a property of the topology of the problem and are not a consequence of training strategies or architecture parameters; the discontinuity in the problem is the underlying source of the generalization failure. The introduction of a single circitkar node at the bottleneck layer gives a continuous mapping function and removes the discontinuity, allowing the network to fully generalize to the entire circle. In general when one uses a bottleneck layer consisting of a single circular node one expects to find the best closed curve approximation to the data, i.e., the reconstructed data are necessarily a closed curve that is an image of the circle on the bottleneck. For an example of the best closed curve approximation to the Lorenz attractor, see Kirby and Miranda (1994b).
-
5.2.3 The Trrfoir Ktiot. A beautiful example of the technique, which also illustrates the capability of unraveling topological complexity, is the case of pattern data lying on a trefoil knot K in w3. This knot is also referred to as a (2.3)-torusknot; it is obtained by winding around a torus twice in one dimension while going around three times in the other. The parametric equations are given below: x ( t ) = cos(2ti[2t cos(3t)j !/(ti
=
sin(2t)[24cos(3t)l
z(t)
=
sin(3t)
Circular Nodes in Neural Networks
397
Figure 1: The output of a single sigmoidal bottleneck node as a function of angle for 20 trained points is represented by the symbol 0. The attempted generalization of this network trained on 20 points applied to 100 points on the circle is represented by the symbol +.
The knot itself is intrinsically homeomorphic to a circle; the embedding into w3 is what makes it topologically interesting. A parameterization mapping H : S' -+ K gives such an embedding. The compression mapping G : K -+ S' may be considered as the "unknotting" of the trefoil knot. The architecture used in our experiment was a 3-15-1-15-3 bottleneck network, all of whose nodes were sigmoidal except for the single circular node in the middle (bottleneck) layer. We trained this network to approximate the identity mapping for 1000 points on the trefoil knot K. In Figure 2 we show the trefoil knot and the result of applying the parameterization mapping H to 400 points on the unit circle, to obtain points in w3 that approximate the trefoil knot.
5.3 Uncovering a High-Dimensional Limit Cycle. The KuramotoSivashinsky (K-S) equation
Michael J. Kirby and Rick Miranda
398
Figure 2: The trefoil knot in W3 is represented by the solid line. The output of the network modeling the trefoil knot is shown as dots.
has become a benchmark for many theories on dissipative dynamic systems and global attractors. It exhibits low-dimensional complex dynamics including chaos and it has been shown to possess an inertial Inanifold, i.e., there exists a smooth (C') low-dimensional description of its dynamics. Thus in the spirit of the Lorenz equations one expects that a small system of ordinary differential equations will model the PDE and in fact low-dimensional approximations of the inertial manifold have been constructed for Neuman boundary conditions in Jolly et al. (1990) and for periodic boundary conditions in Armbruster ef al. (1989). The geometry of the solutions has also been studied and Karhunen-Lo6ve-based lowdimensional approximations presented in Kirby and Armbruster (1992). In this section we investigate how a bottleneck neural network with a circular node can be used to study the dynamics of the K-S equation. To obtain data for the K-S equation (with periodic boundary conditions) we perform a numerical integration by means of a Galerkin approximation. This approach generates a system of ordinary differential equations by decomposing the velocity field via the expansion M
N
-x
-N
(5.2) Substituting 5.2 into the K-S equation 5.1 and exploiting the orthogonality of the complex exponentials one is led to a system of ordinary differential equations in the Fourier coefficients a,(t). Making use of the
Circular Nodes in Neural Networks
399
x
Figure 3: The localized oscillatory pattern of the K-S equation for a = 84.25.
reality condition a, = a-1 and the fact that the a0 term decouples from the system gives (v
N-1
1-1
a, = P(a - 412)Ul + - )-(I
n)nal-,a, - cy C(l+ n)na,+,a,
(5.3) 2 n=l 11=1 where2LI < N - l a n d f o r l = l a n d I = N a n d i z l =(cu-4)al-aCr==,(1n)na,,-lan, i z = ~ fl(a - 4 N 2 ) u ~ (cu/2)c/=:'=,(N - n ) n u ~ - ~ u The , . system of equations for the Fourier coefficients is then integrated numerically. Thus the output of the simulation is the N complex Fourier coefficients [ a l ( t ) .. . . a N ( t ) ] and this can be returned to the original coordinate system using 5.2. It is well established that the K-S equation undergoes a Hopf bifurcation at a = 83.75 resulting in a localized oscillatory pattern. We simulate equation 5.1 with cu = 84.25 and N = 10, and the results are shown in Figure 3. Observe that the oscillations appear to be strictly localized in space traveling up the tip of the higher hump. This is considered to be a forerunner of spatiotemporal complexity encountered in spatially concentrated zones of turbulence. The origin of spatiotemporal complexity in moderate turbulence is one of the big open questions in turbulence research. Using local analysis techniques on the system 5.3 we can show that the solution is temporally periodic near the bifurcation point. Hence, -
+
Michael J. Kirby and Rick Miranda
400
f-:
I
Figure 4: The output of the neural network corresponding to the first four complex Fourier coefficients. while it lies in a 20-dimensional real vector space, it is a limit cycle r homeomorphic to S'. For this reason it is appropriate to attempt to map the data to a circle and in this example we apply the periodic compression architecture to uncover the periodic solution that lies in w2". In our experiment we took N = 10, which was sufficient for numerical reasons and the resulting 10 complex Fourier coefficients { a , , ( t ) 13 . = l.l0} can be written as the 20-tuple r = [ r , ( t ) . y , ( f ) . . . .xl,(f).ylo(f)]with a,, = x,, + iy,?. This time series can be viewed as a periodic data set in R?". To construct a mapping from this data to the circle we utilize a 20-151-15-20 bottleneck neural network with a circular node in the middle bottleneck layer. After normalizing the data by the variance we trained this network to approximate the identity mapping for 1500 points on the high-dimensional cycle r. The network trained quickly to an average error of less than 10-' indicating the excellent degree of fit between the data set and the unit circle. Note that with this construction, the data originates in R'", are funneled through the bottleneck layer into a circle s', and are reproduced by the second half of the network. In Figure 4 we present the output of the second half of the neural network, for the first four Fourier coefficients a l . a 2 . a 3 . 4 ; this is the parameterization H : S' + r, for these same four coefficients. We also remark that the technique of periodic compression can be used to study a sequence of Hopf, or cycle-producing bifurcations. For
Circular Nodes in Neural Networks
401
example, at a = 87 the solution is the oscillating structure of the previous regime, which is now a traveling beating wave. This solution could be mapped to S' x S' and would be an example of manifold decomposition outlined in Section 5.1. In addition, we observe that our analysis is not restricted to a local region about the bifurcation value. The parameterization techniques presented for the analysis of data of dynamic systems such as the K-S equation can also be used as the basis for approximate inertial manifold constructions. In particular, the inverse mappings G and H constructed by the neural network can also be used to transport the dynamical system or flow from the attractor embedded in the high-dimensional space to the model space built as the image of the system in the low-dimensional bottleneck layer. Therefore we not only have a much simpler model for the attractor, but also a potentially simpler model for the differential equation. This work has begun in Kirby and Miranda (1994a) and will be discussed in terms of the circular node in a forthcoming article (Kirby and Miranda 1996).
6 Summary
In this paper we present an implementation of a circular node for a neural network capable of storing and transmitting angular information. It can be used in conjunction with standard sigmoidal nodes to create networks utilizing both types of node. We present the equations for forward and backward propagation, which have the same general form as the standard implementation. Thus the incorporation of circular nodes and the training of networks utilizing them is not radically different from existing algorithms, and can be easily implemented. Finally, we present several examples of the use of circular nodes in pattern analysis. We close with a nontrivial reconstruction of the limit cycle of the KuramoteSivashinsky equation whose original geometry lies in a space of 20 dimensions. Our work suggests that the construction and implementation of other useful types of nodes are both feasible and potentially useful. For example, mappings to two-dimensional manifolds may be better served by the construction of nodes carrying spherical or multitoroidal information (Hundley et al. 1995).
Acknowledgments Research supported in part by the NSF under Grant no. ECS-9312092.
Michael J. Kirby and Rick Miranda
402
References Armbruster, D., Guckenheimer, J., and Holmes, P. 1989. Kuramoto-Sivashinsky dynamics on the center-unstable manifold. SlAM I . Appl. Moth. 49, 676. Demers, D., and Cottrell, G. 1992. In Neitral Irzfortizntiorr Processiizg Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., p. 582. Morgan Kaufmann, San Mateo, CA. DevijITer, P. A,, and Kittler, J. 1982. Potterrr Recogrritiorz: A Stntistical Approach Prentice-Hall, Englewood Cliffs, NJ. Doya, K., and Selverston, A. I. 1994. Dimension reduction of biological neuron models by artificial neural networks. Ntrrral Cornp. 6, 696-717. Cislen, L., Peterson, C., and Siiderberg, B. 1992. Neural Corrrp. 4 737-745. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. [ritmdirctiorr to the Theory of Neural Coiripictntiori. Addison Wesley, Reading, MA. Hundley, D., Kirby, M., and Miranda, R. 1995. Spherical nodes in neural networks and applications. Artificial Nci~rnlNetzoorks irr Erigiireeriiig (to appear). Jolly, M. S., Kevrekidis, I. G., and Titi, E. S. 1990. Approximate inertial manifolds for the Kuramoto-Sivashinsky equation: Analysis and computations. Physicn 44 D, 38. Kirby, M., and Armbruster, D. 1992. Reconstructing phase-space from PDE simulations. Z . ArrSt’iL’ Moth Phys. 43, 999-1022. Kirby, M., and Miranda, R. 1994a. The nonlinear reduction of high-dimensional dynamical systems via neural networks. Pkys. Resl. Lett. 72 (121, 1822-1825. Kirby, M., and Miranda, R. 1994b. The remodeling of chaotic dynamical systems. In Irztelligcvrt Erigirwritq S!ysterri~ Tlirorrgli Artificial Neirral Networks, C . H. Dagli, 8. R. Fernandez, J. Ghosli, and R. T. S. Kumara, eds., Vol. 4, pp. 831-836. Procedirigs of flrc ANNlE ‘94Coiiftwnce, St. Louis, MO, ASME Press, New York. Kirby, M., and Miranda, R. 1996. The reduction of dynamical systems (in preparation). Kramer, M. A. 1991. Nonlinear principal component analysis using autoassociative neural networks. AIClrE 1. 37(2), 233-243. Krischer, K., Rico-Martinez, R., Kevrikidis, I. G., Rotermund, H. H., Ertl, G., and Hudson, J. L. 1993. Model identification of a spatiotemporally varying catalytic reaction. AICliE J., 39(1), 89-98. Lapedes, A. S., and Farber, R. 1987. Notzliwar Sigrid Processirrg Usirig Netiral Netzmrks: Predicfiorz m i d Systrtir Modelirr
~-
~
~~
Received September 14, 1994, accepted February 3, 1995
This article has been cited by: 2. Bei-Wei Lu, Lionel Pandolfo, Kevin Hamilton. 2009. Nonlinear Representation of the Quasi-Biennial Oscillation. Journal of the Atmospheric Sciences 66:7, 1886. [CrossRef] 3. Frederick W. Chen. 2007. Neural Network Characterization of Geophysical Processes With Circular Dependencies. IEEE Transactions on Geoscience and Remote Sensing 45:10, 3037-3043. [CrossRef] 4. Soon-Il An, William W. Hsieh, Fei-Fei Jin. 2005. A Nonlinear Analysis of the ENSO Cycle and Its Interdecadal Changes*. Journal of Climate 18:16, 3229. [CrossRef] 5. William W. Hsieh. 2004. Nonlinear multivariate and time series analysis by neural network methods. Reviews of Geophysics 42:1. . [CrossRef] 6. Youmin Tang. 2003. Nonlinear modes of decadal and interannual variability of the subsurface thermal structure in the Pacific Ocean. Journal of Geophysical Research 108:C3. . [CrossRef] 7. Kevin Hamilton. 2002. Representation of the quasi-biennial oscillation in the tropical stratospheric wind by nonlinear principal component analysis. Journal of Geophysical Research 107:D15. . [CrossRef] 8. William W. Hsieh. 2002. Nonlinear multichannel singular spectrum analysis of the tropical Pacific climate variability using a neural network approach. Journal of Geophysical Research 107:C7. . [CrossRef]
Communicated by Hava Siegelmann
The Computational Power of Discrete Hopfield Nets with Hidden Units Pekka Orponen Department of Computer Science, P.O. Box 26, University of Helsinki, FIN-00014 Helsinki, Finland
We prove that polynomial size discrete Hopfield networks with hidden units compute exactly the class of Boolean functions PSPACE/poly, i.e., the same functions as are computed by polynomial space-bounded nonuniform Turing machines. As a corollary to the construction, we observe also that networks with polynomially bounded interconnection weights compute exactly the class of functions P/poly, i.e., the class computed by polynomial time-bounded nonuniform Turing machines. 1 Introduction
We investigate the power of discrete Hopfield networks (Hopfield 1982) as general computational devices. Our main interest is in the problem of Boolean function computation by symmetric networks of weighted threshold logic units; but for the constructions, we also need to consider asymmetric nets. In our model of network computation, the input to a net is initially loaded onto a set of designated input units; then the states of the units in the network are updated repeatedly, according to their local update rules until the network (possibly) converges to some stable global state, at which point the output is read from a set of designated output units. We consider only finite networks of units with binary states, and with discrete-time synchronous dynamics (i.e., all the units are updated simultaneously in parallel). However, it is known that any computation on a synchronous network can be simulated on an asynchronous network where the updates are performed in a specific sequential order (Tchuente 1986; Bruck and Goodman 1988),or indeed even on a network where the update order is a priori totally undetermined (Orponen 1995). Following the early work of McCulloch and Pitts (1943) and Kleene (1956) (see also Minsky 1972), it has been customary to think of finite (asymmetric) networks of threshold logic units as equivalent to finite automata (for recent work along these lines see, e.g., Alon et al. 1991; Horne and Hush 1994; Indyk 1995). However, in Kleene’s construction for the equivalence, the input to a net is given as a sequence of pulses, Neuml Computatim 8, 403-415 (1996) @ 1996 Massachusetts Institute of Technology
304
Pekka Orponen
whereas from many of the current applications' point of view it would be more natural to think of all of the input as being loaded onto the network in the beginning of a computation. [This is also the input convention followed in standard Boolean circuit complexity theory (Wegener 19871.1 Of course, this view makes the network model noiiiiiziform,in the sense that any single net operates on only fixed-length inputs, and to compute a function on an intinite domain one needs a seqiieiice of networks. Since a recurrent net of s units converging in time f may be "unwound" into a feedforward circuit of size s . f, the class of Boolean functions computed by polynomial size, polynomial time asymmetric nets coincide with the class P/poly of functions computed by polynomial size Boolean circuits or, equivalently, polynomial time Turing machines with a polynomially bounded number of nonuniform "advice bits" (Karp and Lipton 1982; Balcazar r t al. 19881.' On the other hand, if computation times are not bounded, then a relatively straightforward argument shows that the class of functions computed by polynomial size asymmetric nets equals the class PSPACE/poly of functions computed by polynomial space bounded Turing machines with polynomially bounded advice. Parberry (1990) attributes this result to an early unpublished report by Lepley and Miller (1983), but for completeness we outline a proof in Sectloll 3. Thus, general asymmetric recurrent nets are fairly easy to characterize computationally, and turn out to be quite powerful. On the other hand, as pointed out by Hopfield (1982), networks with symmetric interconnections possess natural Liapunov functions, and are thus at least dynamically much more constrained. (The simple dynamics, and the generic form of the Liapunov functions, are also what make symmetric networks so attractive in many applications.) Specifically, Goles (1982) and Hopfield (1982) observed that under asynchronous updates, any symmetric net with no negative self-connections at the units always converges from any initial state to some stable final state. By analyzing the rate of decrease of the Liapunov functions, Fogelman ef n / . (1983) further showed that in a symmetric net of p units with no negative self-connections, and with integer connection weights zv1,, the convergence requires at most a total of
unit state changes. Under synchronous updates a similar bound holds also for nets with negative self-connections (Poljak and Sura 1983; Goles ct a / . 1985; Bruck and Goodman 1988), but in this model the network 'In a recent paper, Siegelmann and Sontag (1994) prove that also f m i u f d size asymmetric recurrent nets with rc.nl-idtted unit states and connection weights, and a saturated-linrar transfer function compute in polynomial time exactly the functions in P/poly.
Discrete Hopfield Nets with Hidden Units
405
may also converge to oscillate between two alternating states instead of a unique stable state. Thus, in particular, symmetric networks with polynomially bounded weights converge in polynomial time. On the other hand, networks with exponentially large weights may indeed require an exponential time to converge, as was first shown by Goles and Olivos (1981)for synchronous updates, and by Haken and Luby (1988) for a particular asynchronous update rule. [The former construction was later simplified by Goles and Martinez (1989).] A network requiring exponential time to converge under an arbitrary asynchronous update order was first demonstrated by Haken (1989). The existence of networks with exponentially long asynchronous transients is now known to follow also from the general theory of local search for optimization problems (Schaffer and Yannakakis 1991). In this paper, we prove that despite their constrained dynamics, computationally symmetric networks lose nothing of their power: specifically, symmetric polynomial size networks with unbounded weights can compute all functions in PSPACE/poly, and networks with polynomially bounded weights can compute all functions in P/poly. The idea, presented in Section 4, is to start with the simulation of space-bounded Turing machines by asymmetric nets, and then replace each of the asymmetric edges by a sequence of symmetric edges whose behavior is sequenced by clock pulses. The appropriate clock can be obtained from, e.g., the symmetric exponential-transient network by Goles and Martinez (1989). Obviously, such a clock network cannot run forever (in this case it is not sufficient to have a network that simply oscillates between two states), but nevertheless the sequence of pulses it generates is long enough to simulate a polynomially space-bounded computation or, in the case of polynomially bounded weights, a polynomially time-bounded one. For more information on the computational aspects of recurrent threshold logic networks, or more generally automata networks, see, e.g., the survey articles and books by Florken and Orponen (19941, Fogelman ef al. (1987),Goles and Martinez (1990), Kamp and Hasler (1990),and Parberry (1990, 1994). 2 Preliminaries
Following Parberry (1990), we define a (discrete) neural network as a 6-tuple N = (V, I , 0,A: w. h ) , where V is a finite set of units, which we assume are indexed as V = (1. . . . ,p}; I C V and 0 V are the sets of input and output units, respectively; A C V is a set of initially active units, of which we require that A n I = 0; w : V x V + Z is the edge weight matrix, and h : V + Z is the threshold vector. The size of a network is its number of units, IVJ = p , and the weight of a network is defined as its sum total of edge weights, C1,,E.v Iwrll. A Hopfield net (with hidden unifs) is a neural network N whose weight matrix is symmetric.
Pekka Orponen
406
Given a neural network N, let us denote 111 = n, 101 = m; moreover, let us assume that the units are indexed so that the input units appear at indices 1 to n. The network then computes a partial mapping fN
: (0, 1)” + (0.
as follows. Given an input x, 1x1 = n, the states s, of the input units are initialized as s, = x,. The states of the units in set A are initialized to 1, and the states of the remaining units are initialized to 0. Then new states s:, i = 1. . . . , p , are computed simultaneously for all the units according to the rule
where sgn(t) = 1 for t 2 0, and sgn(t) = 0 for t < 0. This updating process is repeated until no more changes occur, at which point we say that the network has converged, and the output value f N ( x ) can be read off the output units (in order). If the network does not converge on input x, the value ~ N ( x )is undefined. For simplicity, we consider from now on only networks with a single output unit; the extensions to networks with multiple outputs are straightforward. The language recognized by a single-output network N, with JZ input units, is defined as L ( N ) = {x
E
(0, I}“ I f N ( x )
=
1)
Given a language A (0,1}*, denote A(”)= A n (0,l}“.We consider the following complexity classes of languages:
PNETS
=
{A 2 (0. l}’ 1 for some polynomial 9, there is for each n a network of size at most 9(n) that recognizes A(”) }
PNETS(symm)
=
{A C (0. l}’ I for some polynomial q, there is for each n a Hopfield net of size at most q ( n ) that recognizes A(”) }
PNETS(symm, small)
=
{A C {0,1}* I for some polynomial 9, there is for each n a Hopfield net of weight at most q ( n ) that recognizes A(”) }
Let (x.y) be some standard pairing function mapping pairs of binary strings to binary strings (see, e.g., Balcdzar et al. 1988, p. 7). A language A 2 {0.1}* belongs to the nonuniform complexity class PSPACE/poly (Karp and Lipton 1982; Balcdzar et al. 1987, 1988, p. loo), if there exist a polynomial space bounded Turing machine M and an “advice” function f : N + (0, l}*, such that for some polynomial 9 and all n E N, I f ( i z ) l 5
Discrete Hopfield Nets with Hidden Units
407
q ( n ) , and for all x E {0,1}*,
x EA
e M accepts ( x . f ( l x l ) )
The class P/poly is defined analogously, using polynomial time instead of space bounded Turing machines. The class P/poly can also be characterized as the class of languages recognized by polynomial size-bounded sequences of feedforward Boolean circuits (Balcazar et al. 1988, p. 111). This includes circuits using threshold logic gates, as any threshold function on k variables can be implemented as an AND/OR/NOT-circuit of size O(k2log2k) and depth O(1ogk) (Parberry 1994, p. 173). 3 Simulating Turing Machines with Asymmetric Nets
Simulating space bounded Turing machines with asymmetric neural nets is fairly straightforward. Theorem 3.1. PNETS = PSPACE/poly. Proof (outline). To prove the inclusion PNETS C PSPACE/poly, observe that given (a description of) a neural net, it is possible to simulate its behavior in situ. Hence, there exists a universal neural net interpreter machine M that given a pair ( x . N ) simulates the behavior of net N on input x in linear space. Let then language A be recognized by a polynomial size bounded sequence of nets ( N n ) . Then A E PSPACE/poly via the machine M and advice function f ( n ) = N,. To prove the converse inclusion, let A E PSPACE/poly via a machine M and advice functionf. Let the space complexity of M on input (x.f(Ix1)) be bounded by a polynomial q(Ix1). Without loss of generality (see, e.g., Balcazar et al. 1988) we may assume that M has only one tape, halts on for some constant c, and indicates its any input ( x . f ( l x 1 ) ) in time c'~(lxl), acceptance or rejection of the input by printing a 1 or a 0 on the first square of its tape. Following the standard simulation of Turing machines by combinational circuits (Balchzar et al. 1988, pp. 106-112), it is straightforward to construct for each n a feedforward threshold logic circuit that simulates the behavior of M on inputs of length n. [More precisely, the circuit simulates computations M(( x , f ( n ) ) )where , 1x1 = n.] This circuit consists ~ ) of O[q(n)]parallel wires, where the tth layer represents of ~ 9 ( "layers" the configuration of the machine M at time t (Fig. 1, left). Every two consecutive layers of wires are interconnected by an intermediate layer of q ( n ) constant-size subcircuits, each implementing the local transition rule of machine M at a single position of the simulated configuration. The input x is entered to the circuit along input wires; the advice string f ( n ) appears as a constant input on another set of wires; and the output is read from the particular wire at the end of the circuit that corresponds to the first square of the machine tape.
Pekka Orponen
408
~
... " '
-
I d
I
...
...
...
Figure 1: Simulation of a space bounded Turing machine by an asymmetric recurrent net. One may now observe that the interconnection patterns between layers are very uniform: all the local transition subcircuits are similar, with a structure that depends only on the structure of M , and their number depends only on the length of s. Hence we may replace the exponentially many consecutive layers in the feedforward circuit by a single transformation layer that feeds back on itself (Fig. 1, right). (As can be seen in the figure, we now use an explicit layer of units to represent the configuration of machine M , with small positive self-connections to maintain the representation between successive transformations.) The size of the recurrent network thus obtained is only O [ q( n ) ].When initialized with input x loaded onto the appropriate input units, and advice string f ( n ) mapped to the appropriate initially active units, the network will converge in O[c'!('''] update steps, at which point the output can be read off the unit corresponding to the first square of the machine tape. 0 4 Simulating Asymmetric Nets with Symmetric Nets
Having now shown how to simulate polynomial space bounded Turing machines by polynomial size asymmetric nets, the remaining problem is how to simulate the asymmetric edges in a network by symmetric ones. This is not possible in general, as witnessed by the different convergence behaviors of asymmetric and symmetric nets. However, in the special case of conzlergent computations on asymmetric nets the simulation can be effected. Theorem 4.1. PNETS(synim) = PSPACE/poly.
Discrete Hopfield Nets with Hidden Units
409
Figure 2: A sequence of symmetric edges simulating an asymmetric edge of weight w.
*iUTL I
I
I
I
I
I I I
I I
I
I I
I
I I
I
I
B
I I
I I
0
I
I
I I I I I I
I I I
,
I
1
2
3
4
1
I I I
I I I I I
Figure 3: The clock pulse sequence used in the edge simulation of Figure 2. Proof. Because PNETS(symm) PNETS, and by the previous theorem PNETS PSPACE/poly, it suffices to prove the inclusion PSPACE/poly PNETS(symm1. Given any A E PSPACE/poly, there is by Theorem 3.1 a sequence of polynomial size asymmetric networks recognizing A. Rather than show how this specific sequence of networks can be simulated by symmetric networks, we shall show how to simulate the convergent computations of an arbitrary asymmetric network of n units and e edges of nonzero weight on a symmetric network of O(n e) units and O(n2)edges.
c
c
+
410
Pekka Orponen
The construction combines two network "gadgets": a simplified version of a mechanism due to Hartley and Szu (1987) for simulating an asymmetric edge by a sequence of symmetric edges and their interconnecting units, whose behavior is coordinated by a system clock (Figs. 2 and 3); and a binary counter network d u e to Goles and Martinez (1989; see also Goles and Martinez 1990, pp. 88-95) that can count up to 2" using about 3~ units and O(ir') symmetric edges (Fig. 4). An important observation here is that any convergent computation by a network of 11 units has to terminate in 2" synchronous update steps, because otherwise the network repeats a configuration and goes into a cycle; hence, the exponential time counter network can be used to provide a sufficient number of clock pulses for the simulation to be performed. Let us first consider the gadget for a symmetric simulation of an asymmetric edge of weight 11' from a unit i to a unit j (Fig. 2 ) . Here the idea is that the two units inserted between the units i and j in the symmetric network function as locks in a canal, permitting information to move only from left to right. The locks are sequenced by clock pulses emanating from the units labeled A and B , in cycles of period three as presented in Figure 3. Let us consider the behavior of the gadget starting at some time f = 0 (for simplicity), assuming that at this time clock A is on, the first intermediate unit is clear, clock B is off, and the current state of the simulated unit i is represented in the second intermediate unit. At time 1 clock B turns on, clearing the second intermediate unit at time 2 (note that the connection from unit j is not strong enough to turn this unit back on). However, simultaneously at time 2, a new state is computed at unit j, affected by the state that was in the intermediate unit at time 1. Also, at time 2, clock A turns off, permitting a new state to be copied from unit i to the first intermediate unit at time 3 (i.e., just before the state of unit i becomes indeterminate). At time 3, clock A turns on again, clearing the first intermediate unit at time 4; but simultaneously at time 4 the new state is copied from the first to the second intermediate unit, from where it can then influence the computation of the new state of unit j at time 5. Note that in the construction, the first intermediate unit has no back effect on unit i, because each time a new state is computed at i, the intermediate unit is clear. The next question is how to generate the clock pulses A and B. It is not possible to construct a symmetric clock network that runs forever: at best such a network can end up oscillating between two states, but this is not sufficient to generate the period 3 pulse sequences required for the previous construction. However, Figure 4 presents the first two stages in the construction of a (311 - 4)-unit symmetric network with a convergence time of more than 2" (actually, 2" + 2"+' - 3) synchronous update steps. (For the full details of the construction, see Goles and Martinez (1989).] The idea here is that the i i units in the upper row implement a binary counter, counting from all 0s to all 1s (in the figure,
Discrete Hopfield Nets with Hidden Units
411
Figure 4: The first two stages in the construction of a binary counter network (Goles and Martinez 1989).
the unit corresponding to the least significant bit is to the right). For each "counter" unit added to the upper row, after the two initial ones, two "control" units are added to the lower row. The purpose of the latter is to first turn off all the "old" units, when the new counter unit is activated, and from then on balance the input coming to the old units from the new units, so that the old units may resume counting from zero.* It is possible to derive from such a counter network a sufficient number of the A and B pulse sequences by means of the delay line network presented in Figure 5. Here the unit at the upper left corner is some sufficiently slow oscillator; since we require pulse sequences of period three, this could be the second counter unit in the preceding construction, which is "on" for four update steps at a time. (Thus, a 2"+'-counter suffices to sequence computations of length up to 2" - 1.) The delay line operates as follows: when the oscillator unit turns on, it "primes" the first unit in the line; but nothing else happens until the oscillator turns off. At that point the "on" state begins to travel down the line, one unit per update step, and the pulses A and B are derived from the appropriate points in the line. The value W used in the construction has to be chosen so large that the states of the units in the underlying network have no effect back on the delay line. It is sufficient to choose W larger than the total weight of the underlying network. Similarly, the weights and thresholds in the counter network have to be modified so that the connections to the delay line do not interfere with the counting. Assuming that W 2 3, it is here 2Following the construction by Goles and Martinez (19891, we have made use of one negative self-connection in the counter network. If desired, this can be removed by making two copies of the least significant unit, both with threshold 0, interconnected by an edge of weight -1, and with the same connections to the rest of the network as the current single unit. All the other weights and thresholds in the network must then be doubled.
412
Pekka Orponen
Figure 5: A delay line for generating clock pulses from the binary counter network in Figure 4. sufficient to multiply all the weights and thresholds by 6W, and then subtract one from each threshold. Note that as the counter network eventually converges in a stack with all the units ”on,” the clock pulses correspondingly freeze in positions A “on” and B ”off.” This makes further updates in the underlying network impossible, but retains it in a consistent configuration. 0 Concerning the edge weights in the above constructions, one can see that in the network implementing the machine simulation (Figs. 1 and 2), the weights actually are bounded by some constant that depends only on the simulated machine M; in the delay line, the weights are proportional to the total weight of the underlying network; and the weights in the counter network (Fig. 4) are proportional to the length of the required simulation and, less significantly, to the weight of the delay line. (Note that each new counter unit doubles the running time of the counter network, and, on the other hand, introduces weights of magnitude at most equal to the sum of all the earlier weights.) Thus, we obtain as a corollary to the construction that if the simulated Turing machine (or, more generally, asymmetric network) is known to converge in polynomial time, then it is sufficient to have polynomially bounded weights in the simulating symmetric network. Formulating this in terms of nonuniform complexity classes, we obtain:
Corollary 4.2. PNETStsymm, small) = P/poly. Consequently, anything that can be computed by asymmetric networks, or nonuniform Turing machines, in polynomially bounded time can also be computed by polynomial weight symmetric networks, with their guaranteed good convergence properties.
Discrete Hopfield Nets with Hidden Units
413
The result also implies that large ke., superpolynomial) weights are essential for the computational power of polynomial size symmetric networks if and only if P/poly # PSPACE/poly. [The condition P/poly = PSPACE/poly is known to have the unlikely consequence of collapsing the ”polynomial time hierarchy” to its second level (Karp and Lipton 1982).] In asymmetric networks large weights are not needed; in fact even bounded weights suffice, as can be seen by conceptually replacing each threshold logic unit by a corresponding AND/OR/NOT subcircuit. 5 Conclusion and Open Problems
We have characterized the classes of Boolean functions computed by asymmetric and, more interestingly, symmetric polynomial size recurrent networks of threshold logic units under a synchronous update rule. When no restrictions are placed on either computation time or the sizes of interconnection weights, both of these classes of networks compute exactly the class of functions PSPACE/poly. If interconnection weights are limited to be polynomial in the size of the network, the class of functions computed by symmetric networks reduces to P/poly. This limitation has no effect on the computational power of asymmetric nets. Although we have considered here only networks with discrete synchronous dynamics, it can be shown that any computation on such a network can also be performed on a slightly larger network with a totally asynchronous dynamics (Orponen 1995). Some of the open problems suggested by this work are the following. In the original associative memory model proposed by Hopfield (1982), all the units are used for both input and output, and no hidden units are allowed. Although this is a somewhat artificial restriction from the function computation point of view, characterizing the class of mappings computed by such networks would nevertheless be of some interest in the associative memory context. Of more general interest would be the study of the continuous-time version of Hopfield’s network model (Hopfield 1984; Hopfield and Tank 1985). It will be an exciting broad research task to define the appropriate notions of computability and complexity in this model, and attempt to characterize its computational power. Acknowledgments I wish to thank Mr. Juha Karkkainen for improving on my initial attempts at simplifying the Hartley/Szu network, and suggesting the elegant construction presented in Figure 2. A preliminary version of this work appears in Proceedings oftke 20th Internafioal Colloquium on Automata, Languages, and Programming, Lecture Notes in Computer Science, Vol. 700, pp. 215-226. Springer-Verlag, Berlin, 1993. Part of this work was done while
414
Pekka Orponen
the author w a s visiting the Institute for Theoretical Computer Science, Technical University of Graz, Austria.
References Alon, N., Dewdney, A. K., and Ott, T. J. 1991. Efficient simulation of finite automata by neural nets. 1.ACM 38, 495-514. Anderson, J. A., and Rosenfeld, E. (eds.) 1988. Neirrocoriipirtiri~: Foirridatioris o f Resenrcli. MIT Press, Cambridge, MA. Balcazar, J. L., Diaz, J., and Cabarro, J. 1987. On characterizations of the class PSPACE/poly. TIieoret. Corripirt. Sci. 52, 251-267. Balcazar, J. L., Diaz, J., and Gabarro, J. 1988. Strircfirrnl Cortiple.uit!/ I . SpringerVerlag, Berlin. Bruck, J., and Goodman, J. W. 1988. A generalized convergence theorem for neural networks. I E E E Troris. Iiiforrri. Tlieory 34, 1089-1092. FlorPen, P., and Orponen, P. 1993. Corriplcsity Issitrs iri Discrete Hopfirlil Networks. Tech. Rep. A-19944, University of Helsinki, Dept. of Computer Science, Helsinki, Finland. in Parberry, T.A.(in press). Fogelman, F., Goles, E., and Weisbuch, G. 1983. Transient length in sequential iterations of threshold functions. Discr. Appl. Moth. 6 , 95-98. Fogelman SouliP, E, Robert, Y., and Tchuente, M. 1987. Airtorrintn Nrt7ilorks iri Cortrpirter Scicwc: Tlirctry mid App/icatiori.s. Manchester University Press, Manchester. Goles, E. 1982. Fixed point behavior of threshold functions on a finite set. SIAM 1. Alg. Discr. Metli. 3, 529-531. Golrs, E., and Martinez, S . 1989. Exponential transient classes of symmetric neural networks for synchronous and sequential updating. CorripIts Sys. 3, 589-597. Goles, E., and Martinez, S. 1990. Nrwrol nrill Airtorrifftn Networks. Kluwer Academic, Dordrecht. Coles, E., and Olivos, J. 1981. The convergence of symmetric threshold automata. Iriforrri. Coritrol 51, 98-104. Goles, E., Fogelman, F., and Pellegrin, D. 1985. Decreasing energy functions as a tool for studying threshold networks. Discr. AppI. Mntli. 12, 261-277. Haken, A. 1989. Connectionist networks that need exponential time to converge. Unpublished manuscript, 10 pp. University of Toronto, Dept. of Computer Science, January. Haken, A,,and Luby, M. 1988. Steepest descent can take exponential time for symmetric connection networks. Corriplrs Syst. 2, 191-196. Hartley, R., and Szu, H. 1987. A comparison of the computational power of neural networks. In Proceeilirigs of tlie 2 987 Irittvxntiorinl Corifi.rcwce 011 N~irrid Networks, Vol. 3, pp. 15-22. IEEE, New York. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nntf. Acrid. Sci. U.S.A. 79, 2554-2558. Reprinted in Anderson and Rosenfeld (1988, pp. 460-4631. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Not/.Acnd. Sci. U.S.A. 81, 3088-3092. Reprinted in Anderson and Rosenfeld (1988, pp. 579-583).
Discrete Hopfield Nets with Hidden Units
415
Hopfield, J. J., and Tank, D. W. 1985. ”Neural” computation of decisions in optimization problems. Biol. Cybern. 52, 141-152. Horne, B. G., and Hush, D. R. 1994. Bounds on the complexity of recurrent neural network implementations of finite state machines. In Advances in Neural Information Processing Systems 6 , J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 359-366. Morgan Kaufmann, San Francisco, CA. Indyk, P. 1995. Optimal simulation of automata by neural nets. In Proceedings uf the 12th Annual Symposium on Theoretical Aspects of Computer Science (Munich, Germany, March 1995). Lecture Notes in Computer Science 900, pp. 337-348. Springer-Verlag, Berlin. Kamp, Y., and Hasler, M. 1990. Recursive Neural Networks for Associative Memory. John Wiley, Chichester. Karp, R. M., and Lipton, R. J. 1982. Turing machines that take advice. L’Enseignement Math. 28, 191-209. Kleene, S. C. 1956. Representation of events in nerve nets and finite automata. In Automata Studies, C. E. Shannon and J. McCarthy, eds., pp. 3-41. Annals of Mathematics Studies no. 34. Princeton Univ. Press, Princeton, NJ. Lepley, M., and Miller, G. 1983. Computational power for networks of threshold devices in an asynchronous environment. Unpublished manuscript, 6 pp., Massachusetts Inst. of Technology, Dept. of Mathematics. McCulloch, W. S., and Pitts, W. 1943. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115-133. Reprinted in Anderson and Rosefeld (1988, pp. 18-27). Minsky, M. L. 1972. Computation: Finite and Infinite Machines. Prentice-Hall, Englewood Cliffs, NJ. Orponen, P. 1995. Computing with truly asynchronous threshold logic networks. Theor. Comput. Sci. (in press). Parberry, I. 1990. A primer on the complexity theory of neural networks. In Furma1 Techniquesin ArtificialIntelligence: A Sourcebook, R. B. Banerji, ed., pp. 217268. Elsevier-North-Holland, Amsterdam. Parberry, I. 1994. Circuit Complexity and Neural Networks. MIT Press, Cambridge, MA. Parberry, I. (ed.) 1995. The Computational and Learning Complexity of Neural Networks: Advanced Topics. (in preparation). Poljak, S., and Shra, M. 1983. On periodical behaviour in societies with symmetric influences. Combinatorica 3, 119-121. Schaffer, A. A,, and Yannakakis, M. 1991. Simple local search problems that are hard to solve. SlAh4 J. Comp. 20, 56-87. Siegelmann, H. T., and Sontag, E. D. 1994. Analog computation via neural networks. Theor. Comp. Sci. 131, 331-360. Tchuente, M. 1986. Sequential simulation of parallel iterations and applications. Theor. Comp. Sci. 48, 135-144. Wegener, I. 1987. The Complexity of Boolean Functions. John Wiley, Chichester, and B. G. Teubner, Stuttgart.
Received April 4, 1995; accepted June 21, 1995
This article has been cited by: 2. Asa Ben-Hur, Hava T. Siegelmann. 2004. Computation in gene networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 14:1, 145. [CrossRef] 3. Jiří Šíma , Pekka Orponen . 2003. General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic ResultsGeneral-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results. Neural Computation 15:12, 2727-2778. [Abstract] [PDF] [PDF Plus] 4. Jiří Šíma , Pekka Orponen . 2003. Continuous-Time Symmetric Hopfield Nets Are Computationally UniversalContinuous-Time Symmetric Hopfield Nets Are Computationally Universal. Neural Computation 15:3, 693-733. [Abstract] [PDF] [PDF Plus] 5. Jiří Šíma , Pekka Orponen , Teemu Antti-Poika . 2000. On the Computational Complexity of Binary and Analog Symmetric Hopfield NetsOn the Computational Complexity of Binary and Analog Symmetric Hopfield Nets. Neural Computation 12:12, 2965-2989. [Abstract] [PDF] [PDF Plus] 6. J.L. Balcazar, R. Gavalda, H.T. Siegelmann. 1997. Computational power of neural networks: a characterization in terms of Kolmogorov complexity. IEEE Transactions on Information Theory 43:4, 1175-1183. [CrossRef]
Communicated by David Wilshaw
A Self-organizing Neural Network for the Traveling Salesman Problem That Is Competitive with Simulated Annealing Marco BudinichC m t e r for Nritral N e f ~ o r k sKing’s , Collrgr
Loridoii,
Loizdoiz,
UK
Unsupervised learning applied to an unstructured neural network can give approximate solutions to the traveling salesman problem. For 50 cities in the plane this algorithm performs like the elastic net of Durbin and Willshaw (1987) and it improves when increasing the number of cities to get better than simulated annealing for problems with more than 500 cities. In all the tests this algorithm requires a fraction of the time taken by simulated annealing. 1 Introduction
The Traveling Salesman Problem (TSP) is the archetype of NP-complete problems and has attracted considerable research effort (see, eg., Lawler f t d.1985). In one of its simplest forms there are IZ cities in the plane and the problem is to find the shortest closed tour that visits each city once. While all exact solutions take a time that grows exponentially with iz, there are many heuristic algorithms for practical instances (Johnson 1990). In this paper I propose a neural network algorithm that produces approximate solutions for the TSP. Unlike more traditional neural network approaches (Hopfield 1985), here the solution is given by the selforganizing properties of unsupervised learning. The basic idea comes from the observation that in one dimension the exact solution to the TSP is trivial: always step to the nearest unvisited city. Consequently, given a TSP with cities in the plane, if one can map them ”smartly” to a set of cities distributed on a circle, one will easily find the shortest tour for these ”image cities” and a visit order for them: this visit order will give also a tour for the original cities. Obviously if the map preserves perfectly all the distance relations of the cities this will be the shortest TSP tour, i.e., the exact solution. Unfortunately a perfectly neighborhood preserving map does not exist in general, but it is reasonable to conjecture that the better the distance relations are .Permanent address: Dipartimento di Fisica & INFN, Via Valerio 2, 34127 Trieste, Italy. e-mail:
[email protected]
N~,irrc?/Coiiiyiitatioii 8, 416424 (1996)
0 1996
Massachusetts Institute of Technology
Neural Network for the TSP
417
preserved the better will be the approximate solution found. In this way the original TSP is transposed to the search of a good neighborhood preserving map. There are several ways to search for such a map: Durbin and Willshaw (1987) did it relaxing a fictitious physical system, an "elastic net." Another method is to use a self-organizing neural net to find neighborhood preserving maps. These nets, proposed to model the self-organizing feature maps in the brain (Von der Malsburg 1973; Kohonen 19841, can find good neighborhood preserving maps through unsupervised learning. In this way the TSP is ultimately solved by unsupervised learning. In conclusion there are two steps to get a solution for the TSP: the first is to teach the problem to a self-organizing neural net that, while learning, builds up the neighborhood preserving map. In the second step, from the solution for the image cities, one obtains a tour for the original TSI? In what follows I will present in detail this algorithm and compare it with the well-established simulated annealing technique (Kirkpatrick et d.1983), and, given the similarity of the approaches, with the elastic net (Durbin and Willshaw 1987).
2 Solving the TSP with Self-organizing Maps
The TSP considered here is given by n cities randomly distributed in the (0,l) square. The network has two input neurons that pass the (x.y) = coordinates of the cities to n output neurons. The output neurons form a ring; the distance D(i,j ) between neurons i and j is one plus the minimum number of neurons between i and j . Each output neuron has just two weights: (wx.wy) = m and, in response to the input (x,y) from city <, gives an output o =
<
1. set the weights to initial random values in [0,1];
2. select a city at random, say neurons;
<, and feed its coordinates to the input
3. find the output neuron with maximal output, say m;' 'This definition is ambiguous unless city and weight vectors are somehow normalized. Since both cities and weights define points in the plane, the problem can be circumvented by defining the most active neuron for city as the neuron that weights define the nearest point to C. Simple algebra shows that the two definitions are equivalent.
<
Marco Budinich
418
Figure 1: Schematic net: not all connections from input neurons are drawn. In this example D ( i . j ) = 2. 4. train IK and its neighbors u p to a distance d with Hebb rule; the
training affects 211 -
+
-
1 neurons:
ir: = Z, + o ( < -
m!)
Vi: D(i.i n ) 5 d
5. update the parameters d and (1 according to a predefined schedule and, if the learning loops are not yet finished, go to 2. A preliminary study of the learning parameters gave this recipe to obtain the best performances:
0
the number of output neurons is fixed and equal to the number of cities i i ; 50 learning loops for each city of the problem; the learning constant a starts from 0.8 and its value is decreased linearly at each learning loop reaching zero at the last iteration; the most active neuron 111 is trained with learning constant (1, its neighbors are trained with a value of ( 1 that decreases linearly with Dii. t n) and vanishes for neurons at a distance greater than d from 111;
0
the distance of update d is 6.2 + 0.03712 at the start and is decreased linearly to 1 in the first 65% of the iterations. For the successive 3% of the learning loops d remains constant at 1.
Neural Network for the TSP
0
5
41 9
10
15
20
25
30
Neuron number
Figure 2: Three activity profiles for a net with 30 neurons obtained presenting three different cities (of the 30 of the original TSP). Cities A and B both give maximal activity in neuron number 10.
After learning, the net maps neighboring cities to neighboring neurons: for each city its image is given by the neuron with maximal activity. One easily finds the shortest tour for the images of the cities that, in turn, gives a tour for the TSP. This straightforward version of the learning algorithm has a major flaw. The map it produces is not injective: many cities can be mapped to the same neuron (this happens for a fraction between 45 and 50% of the total number of cities n for 10 5 n 5 1000). When two or more cities are mapped to the same neuron one cannot say which of them has to come first in the tour and this problem substantially reduces the performances of this algorithm (Angeniol and Walker 1988; Favata and Walker 1991).
3 A New TSP Algorithm
A decisive insight comes from the ”activity profile” of the net: a plot of the neuron outputs in response to a given city. Figure 2 contains the activity profiles for 3 cities on a net with 30 neurons. Cities A and B both produce the maximal activity on the tenth neuron. When mapping cities to neurons we associate to each city a coordinate along the ring. The standard choice of the maximal acting neuron
420
Marco Budinich
produces an integer coordinate that is the modal value of the activity profile. A better choice turns out to be an averaging process on the neurons to obtain a real coordinate: for example, a weighted coordinate calculated with the maximal active neuron and its nearest neighbors using their activities as weights. In this way each city is mapped to a point with a real coordinate along the ring, the map becomes injective, and the ambiguities in the tour disappear. This new mapping produces substantially shorter tours assuring that there is valuable information also in the neurons near to the most active one. The average not only uses this additional information but is also a better and more plausible estimator of the city image than the modal value since it is a linear function of the neuron activities and makes it easier to imagine a further layer of neurons doing this job. In what follows I try to assess the effectiveness of this algorithm comparing it with two other stochastic ones: simulated annealing (Kirkpatrick rt a / . 1983) and the elastic net algorithm (Durbin and Willshaw 1987). 4 Performances
It is known that comparing the performance of stochastic TSP algorithms is not indisputable, especially when the length of the shortest tour is not available (Johnson 1990). I took a rather conservative approach comparing the average length obtained in 10 runs of each algorithm on the same problem keeping track separately of the CPU time used. This solution is less subject to fluctuations than, for example, a comparison based on the minimal length obtained by an algorithm in a fixed amount of CPU time. The implementation of simulated annealing used in these tests is that published in Miiller and Reinhardt (19901, which is tailored to the TSP since it uses exchange terms inspired by Lin and Kernigham (1973) heuristics. With the chosen annealing factor of 0.95 the algorithm gave better performances than those quoted by Durbin and Willshaw (1987) on the same problems. Figure 3 is a test of its stability and the lower, dashed, curve shows that its performances remain reasonably constant, with respect to the theoretical lower bound of 0.749 fi,when the number of cities n varies between 10 and 1000. Johnson (1990) performed extended comparisons on the performances of various TSP heuristics including simulated annealing (even if not in the implementation used in this work). He showed that when the performances are measured with the average of different runs, simulated annealing is favored and, even if in longer times, beats the most reputed heuristics like that of Lin and Kernigham (1973) and, a fortiori, 2-opt and ?opt.
Neural Network for the TSP
421
Figure 3: Ratio of the average tour length obtained in 10 runs over the theoretical lower bound of 0.749 fi (thick curve). The upper, dotted, curve shows the results obtained from (my implementation of) the Favata and Walker algorithm (1991), while the lower, dashed, curve shows results from simulated annealing.
t
200 0.95
t
400
n (cities)
Figure 4: Ratio of the average tour length obtained in 10 runs from this algorithm over those given by simulated annealing; the number of cities varies between 10 and 1000. Figure 4 contains the ratio of the average tour lengths of this algorithm over those obtained from simulated annealing. The neural net produces shorter tours than those of simulated annealing for problems with more than n = 500 cities. It is interesting to speculate on the properties of this algorithm as
Marco Budinich
422
Table 1: Comparison of Average Tour Length in 5 Runs for the Same Problems
City set
Durbin (1987)
This algorithm
1 2 3 4
5.98 6.03 5.74 (5.70) 5.90 (5.86) 6.49
5.975 6.110 5.737 5.830 6.583
r
3
Difference (70) -0.08 fl.33 -0.05 (+0.65) -1.19 (-0.51) fl.43
depending on 11, the number of cities of the problem. The amount of memory grows like O(rr)while the time employed grows at most like ~ ( I z ' since ), both the learning loops and the search of the most active neuron depend linearly on 11. However, this is probably just an upper bound since Figure 3 shows that this prescription for the number of iterations gives performances that increase with 17. Consequently it is possible that to maintain constant performances, one can use a number of learning loops that increases less than linearly with ti. Comparing the CPU time of the runs used for Figure 3 it appears that for 100 cities, this algorithm is 10 times faster than simulated annealing and it is 3 times faster for 1000 cities producing, in this case, better average performances. This decrease of the ratio is probably due to the nonuniform performances when varying the number of cities. It is remarkable that this net can perform better than simulated annealing in its best achievement (average length) compensating at the same time its main weakness: slowness. A different test of the performances was made on the 5 city sets used by Durbin and Willshaw (1987). To test the algorithm in conditions as similar as possible to those of the elastic net the parameters of the net were tuned for 50 cities, and the average performances were taken over 5 runs. Table 1 contains a comparison of the performances and shows that they are substantially equal, but the CPU times are rather different. Since exact time data were unavailable for comparisons I considered the time used by each algorithm relative to the time taken by simulated annealing. Whereas Durbin and Willshaw report a time 30% longer than that employed by simulated annealing this net is 10 times faster than simulated annealing.
5 Conclusions
Even if this is the first time a neural network algorithm proves to be competitive with simulated annealing, deeper analysis is needed to establish if it has any relevance as a serious TSP heuristic. Just to mention another
Neural Network for the TSP
423
limitation, it is well known that the TSPs with random cities in the plane are relatively easy. Nevertheless the results show that this net works satisfactorily on a provably difficult problem and especially in large sizes: a not so frequent quality in neural networks. Even more important is that the results derive from a well-established learning procedure applied to a previously unstructured net. This poses this approach in a favorable position when compared to those that solve the TSP relaxing a finely pretuned neural network (Hopfield and Tank 1985). It is also intriguing that the theory behind this algorithm, intimately connected to the self-organizing processes (Erwin et al. 19921, is today not really understood (even if the ordering theorem of Kohonen (1984) can be extended to the d to one-dimensional case (Budinich and Taylor 1995)).
Acknowledgments
I want to acknowledge the many, extremely fruitful, discussions with John G. Taylor at King’s College. I also thank warmly King’s College, the British Council, and the “Consiglio Nazionale delle Ricerche” for supporting my visit at King’s, and D. Willshaw and M. Simmen for kind hospitality and for providing the TSPs used in Durbin and Willshaw (1987).
References Angkniol, B., de La Croix Vaubois, G., and Le Texier, J.-Y. 1988. Self-organising feature maps and the travelling salesman problem. Neurd Networks 1, 289293; Budinich, M., and Taylor, J. G. 1995. On the ordering conditions for selforganizing maps. Neural Comp. 7(2), 284-289. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 336, 689691. Erwin, E., Obermayer, K., and Schulten, K. 1992. Self organising maps: Ordering, convergence properties and energy functions. Biol. Cybern. 67, 47-55. Favata, F., and Walker, R. 1991. A study of the application of Kohonen-type neural networks to the travelling salesman problem. Bid. Cybern. 64, 463468. Johnson, D. 1990. In Proceedings of the 17th Colloquium on Automata, Languages and Programming, pp. 446-461. Springer-Verlag, New York. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Bid. Cybern. 52, 141-152. Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220031, 671-680.
424
Marco Budinich
Kohonen, T. 1989. Self-Orgniiisntioii nrzd Associntizv Meniory, 3rd ed. SpringerVerlag, Berlin. Lawler, E. L., Lenstra, J . K., Rinnoy Kan, A. G., and Shmoys, D. B., eds., The Trnzding Snlcsninii ProbIerri-A Girided Tour of Cornbinntorinl Optimization, IV Reprint, pp. x 474. John Wiley, New York. Lin, S., and Kernigham, B. W. 1973. An effective heuristic algorithm for the traveling salesman problem. Opmtioizs Res. 21, 498-516. Muller, B., and Reinhnrdt, J. 1990. N ~ r a Netiuorks. l Springer-Verlag, Berlin. Von der Malsburg, Ch. 1973. Self-organising of orientation sensitive cells in striate cortex. Kybernetik 14, 85100. _ _ ~- R e c e n d January 18, 1994, accepted June 13, 1995
This article has been cited by: 2. Jun-Ying ZHANG, Bin ZHOU. 2009. Self Organizing Map with Generalized and Localized Parallel Competitions for the TSP. Chinese Journal of Computers 31:2, 220-227. [CrossRef] 3. Yu-Zhen Guo, En-Min Feng, Yong Wang. 2006. Exploration of two-dimensional hydrophobic-polar lattice model by combining local search with elastic net algorithm. The Journal of Chemical Physics 125:15, 154102. [CrossRef] 4. Martin Schwardt, Jan Dethloff. 2005. Solving a continuous location-routing problem by use of a self-organizing map. International Journal of Physical Distribution & Logistics Management 35:6, 390-408. [CrossRef] 5. G Liao, S Liu, T Shi, G Zhang. 2004. Gearbox condition monitoring using self-organizing feature maps. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 218:1, 119-129. [CrossRef] 6. Hui-Dong Jin, Kwong-Sak Leung, Man-Leung Wong, Zong-Ben Xu. 2003. An efficient self-organizing map designed by genetic algorithms for the traveling salesman problem. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:6, 877-888. [CrossRef] 7. N. Aras, I.K. Altinel, J. Oommen. 2003. A Kohonen-like decomposition method for the euclidean traveling salesman problem - KNIES_DECOMPOSE. IEEE Transactions on Neural Networks 14:4, 869-890. [CrossRef] 8. Berrin Yanikoglu, Burak Erman. 2002. Minimum Energy Configurations of the 2-Dimensional HP-Model of Proteins by Self-Organizing Networks. Journal of Computational Biology 9:4, 613-620. [CrossRef] 9. Xiaopeng Chen, K.M. Chugg, M.A. Neifeld. 1998. Near-optimal parallel distributed data detection for page-oriented optical memories. IEEE Journal of Selected Topics in Quantum Electronics 4:5, 866-879. [CrossRef]
Communicated by Joachim Buhmann
Hierarchical, Unsupervised Learning with Growing via Phase Transitions David Miller Kenneth Rose Department of Electrical and Computer Engineering, University of California, Santa Barbara, C A 93106 U S A
We address unsupervised learning subject to structural constraints, with particular emphasis placed on clustering with an imposed decision tree structure. Most known methods are greedy, optimizing one node of the tree at a time to minimize a local cost. By constrast, we develop a joint optimization method, derived based on informationtheoretic principles and closely related to known methods in statistical physics. The approach is inspired by the deterministic annealing algorithm for unstructured data clustering, which was based on maximum entropy inference. The new approach is founded on the principle of minimum cross-entropy, using informative priors to approximate the unstructured clustering solution while imposing the structural constraint. The resulting method incorporates supervised learning principles applied in an unsupervised problem setting. In our approach, the tree "grows" by a sequence of bifurcations that occur while optimizing an effective free energy cost at decreasing temperature scales. Thus, estimates of the tree size and structure are naturally obtained at each temperature in the process. Examples demonstrate considerable improvement over known methods. 1 Introduction The problem of clustering has long been of interest to both scientists and engineers. Clustering methods are basic tools for unsupervised data analysis and classification, and these procedures have been developed and applied within a variety of fields. While its basic domain is pattern recognition, clustering is important in data compression, statistics, neural networks, and the natural sciences. A review of the problem can be found in Duda and Hart (1974) and Jain and Dubes (1988). Clustering is often made precise by the specification of a cost criterion to be minimized. An important objective is the sum of squared distances cost or "energy" function
E =C 1
C IX - YjI
2
(1.I)
rtc,
Neural Computation 8, 425-450 (1996) @ 1996 Massachusetts Institute of Technology
426
David Miller and Kenneth Rose
where C, is the jth cluster with representative (mean) y i and where x is an element of the data set. While numerous methods have been proposed for optimizing costs such as E , most procedures are modifications or extensions of the basic Isodata algorithm (Ball and Hall 1967) or its sequential relative, the K-means algorithm (MacQueen 1967). Related procedures have also been proposed for scalar quantization (Lloyd, 19821, vector quantization (VQ) (Linde et 01. 19801, and fuzzy clustering (Dunn 1974; Bezdek 1980). Most cost functions, including the sum of squared distances, are nonconvex with many local minima to trap standard descent methods like basic Isodata. These methods are strongly dependent on the initialization and some capability for avoiding or escaping local minima is necessary for improving them. Annealing methods are powerful techniques for global optimization, motivated by the physical analogy. The stochastic, simulated annealing method (Kirkpatrick t p t al. 1983) has been proven to converge in distribution to the set of globally optimal solutions (Geman and Geman 1984) but the computational burden of the method limits its applicability. Alternatively, a deterministic approximation to simulated annealing was proposed within several different problem domains (Simic 1990; Yuille 1990; Geiger and Girosi 1991; Durbin and Willshaw 1987). For clustering, the deterministic annealing method (DA) was developed based on information-theoretic principles (Rose et nl. 1990; Rose L>t a/. 1992). DA has been shown to achieve significant improvement over conventional approaches for the basic clustering problem. Moreover, the method has been generalized to address several important extensions of the basic problem (Rose et a / . 1993; Buhmann and Kuhnel 1994; Miller and Rose 1994a; Miller et a / . 1994). Recently, close ties were established between the DA approach and rate-distortion theory, yielding contributions to the fundamental information-theoretic problem of rate-distortion function computation and analysis (Rose 1994). In the present work, we seek to generalize the DA approach to address the problem in which structural constraints are imposed on the clustering solution. A typical structure is a tree, for which solutions can be represented by a decision tree diagram, as shown in Figure 1 for a binary tree of depth three. Here, the tree parameters, when used with an associated measure of distance, collectively specify a partitioning of the data space. For a tree of depth L and branching factor K, we denote the node parameters {si} where 1 = 1. . . . . L denotes the layer of the tree, and j = 0. . . . .K' - 1 denotes the node's position within the layer. There is an important distinction between the role of internal nodes, which specify a hierarchical partitioning and the role of leaf nodes (layer L), which are the cluster representative vectors for the partition regions. While this distinction will be further explained later, we emphasize the different roles of the leaf and nonleaf layers by introducing a special notation for the leaf parameter set, Y = {y,} = {sf}, and referring to these parameters a s code wcfors or cluster repwsentntiiw. The internal-node parameters, which
Unsupervised Learning
Yo
x
427
Y2
Y,
Figure 1: A tree diagram for a balanced, binary tree of depth three.
are also typically vector-valued, will be referred to as test vectors. This terminology is borrowed from the vector quantization literature. Specializing the description to binary trees ( K = 2), the test vectors {si} determine a sequence of nested half-space tests that, when traversed from root to leaf, lead to the partition regions of the data space. At each pair of sibling nodes (s:,s\+,),j even, an equation of the form d(x,s:) = d(x. specifies a decision boundary. Here, d(., .) is a dissimilarity ("distance") measure. If d ( . , .) is the commonly used squared distance, then the boundaries are hyperplanes and the partition regions are convex cells. This description is trivially extended to trees of higher branching factors, where a Voronoi (nearest neighbor) partition is used by the set of siblings to partition the cell belonging to their parent. Tree-structured clustering has important applications in unsupervised learning, including vector quantization (Buzo et al. 1980) and numerical taxonomy (Sokal and Sneath 1963). Moreover, the clustering solution is relevant to supervised learning problems such as prototype-based statistical classifier design as well, since in practice class labels may not exist or obtaining them may be an expensive process. In these cases, clustering is an important design step and several methods have been applied (Lippmann 1989; Farrell et al. 1994). There are several advantages to the tree-structured architecture. For classification and regression, the structure can be efficiently pruned to search for a parsimonious model or for the minimum cost structure
428
David Miller and Kenneth Rose
given a constraint on model complexity (Breiman rt nl. 1980; Chou et ul. 1989). Another advantage relates to the complexity of statistical classification (known as the encoder search complexity in VQ). Unstructured prototype-based classifiers and vector quantizers may require an exhaustive search for the nearest prototype, which is impractical when the feature space and number of prototypes are large. Alternatively, there are methods for implementing efficient nearest prototype searches (e.g., Gersho and Gray 1992, Chapter 101, but these approaches may require substantial increases in memory storage, and the reduction in search is not guaranteed to be significant in general. The alternative that has been gaining increasing popularity with VQ researchers is to impose structural constraints on the design. The tree-structured classifiers we consider do not guarantee finding the nearest prototype, but they d o achieve a substantial reduction in classification search. In the V Q context, this property makes tree-structured vector quantizers (TSVQs) a practical alternative to full search (unstructured) VQ, and, indeed, structurally constrained VQ has been intensively investigated in recent years (Gersho and Gray 1992, Chapter 12). While standard optimization methods for unstructured clustering are iterative descent procedures that guarantee convergence to a local optimum, approaches to tree-structured clustering such as the splitting algorithm (Hartigan 1975; Buzo rt a / . 1980; Riskin and Gray 1991) are greedy, optimizing a local cost to grow a tree one node at a time. The primary reason is that whereas an optimal partition design step is readily specified by the nearest neighbor rule in the unstructured case, in the tree-structured case an optimal partition is determined only by solving a formidable multiclass risk discrimination problem (Duda and Hart 1974). Accordingly, standard methods grow trees in a greedy fashion, using heuristics both for determining the order of node splits (and hence the tree structure), and for performing the node splitting. While pruning can be used to find a good cost/complexity tradeoff (Chou et 01. 19891, the solution quality of this method depends on the initial tree. Pruning does not move partition region boundaries to improve the solution, but rather removes boundaries from the solution. Alternatives to greedy methods were recently proposed for the general regression problem (Jordan and Jacobs 1994) and for the specific case of tree-structured clustering (Miller and Rose 1994b). Both the approach of Jordan and Jacobs and our own preliminary work are based on a probabilistic formulation. However, their method performs direct descent on a cost surface and thus may be sensitive to finding poor local minima, while our approach for the restricted problem is an annealing approach and attempts to avoid local minima. In this paper we fundamentally extend our preliminary work, both in its theoretical development and its practical application. First, the method is given a more sound theoretical justification based on the information-theoretic principle of minimum cross-entropy. This inference is
Unsupervised Learning
429
used to introduce a new paradigm for structured clustering within a probabilistic framework-the goal of approximating the optimal (unstructured) clustering solution while imposing the structural constraint. Second, it is recognized that the clustering solution can be improved by generalizing the class of distance measures used by the structured hierarchy. An example will be shown using Mahalanobis distance (Duda and Hart 1974). Finally, and most importantly, we note that whereas our earlier work used annealing to optimize a tree of fixed structure, a more fundamental advantage of the approach relates to phase transitions in the solution process and their connection with growth in the tree model. In particular, it is recognized here that our effective tree model ”grows” by bifurcations, which occur while minimizing a free energy cost at decreasing “temperature” scales. Thus, our method naturally provides an estimate of the model size and structure at each temperature. In the sequel, we will first generalize the probabilistic inference used by DA for the unstructured clustering problem to include prior probabilities. 2 A Tree-Structured Clustering Method 2.1 Minimum Cross Entropy Inference. Even if one is interested in a ”hard” (i.e., nonfuzzy) clustering solution, still it may be useful within the context of an optimization method to consider points associated in probability with clusters. In the basic DA method (Rose ef al. 1993, no underlying assumptions were made about the data distribution. Accordingly, the principle of maximum entropy (Jaynes 1989) was invoked to obtain probabilities of association {P[x E Cj]} between data points and cluster representatives. More concretely, the probabilistic inference was obtained by posing for each data point
max
{c -
1
P [ x E C]]log P [ x E C]]
I
subject to
< Ex >= CP[X E Cj]d(x.yj) I
The solution is the Gibbs distribution (2.2)
where the denominator is a partition function from statistical physics and p is a Lagrange multiplier. Now suppose that there is prior knowledge relating points and clusters stated in the form of probabilities, { Q [ x E Cl]}. The natural generalization of maximum entropy inference to include a prior is the principle
David Miller and Kenneth Rose
430
of minimum cross entropy (Kullback 1969). In Shore and Johnson (1980) it was shown that this principle is the consistent principle of inference given new information. Accordingly, we pose the problem (2.3)
subject to
< E, >= ~ P [ E XCj]d(x.y;) i
The solution is the so-called ”tilted” distribution: (2.4) The Lagrange multiplier 4 determines the value of < E , >. It also can be interpreted as an inverse ”temperature” influencing the degree of fuzziness of the distribution. For uniform, these associations revert to the maximum entropy associations of equation 2.2 as we expect. We note further that for il = 0 we obtain P[x E C,] = Q [ x E C,], i.e., we give full weight to the prior and ignore the clustering distortion. At the other limit of 1) + 00 we minimize distortion and totally ignore the prior, except concerning classifications that are precluded by the prior, i.e., for which Q [ x E C,] = 0, and hence P[x E C,] = 0. This restriction of the distribution through the prior will later provide a tool for imposing constraints on the clustering solution within an annealing optimization framework. The partition function associated with a single datum is the denominator of equation 2.4. The total partition function over the entire training set is the product
{a[.]}
(2.5) r
x
/
Correspondingly, the free energy in the physical analogy is (2.6)
This function is a generalization of the effective cost minimized by the basic DA method. The free energy is the quantity minimized at isothermal equilibrium by simulated annealing and it can also be viewed as the log likelihood associated with a mixture model, as discussed in Jordan and Jacobs (1994). For {Q[.]-+ (0.1)) or for d -+ cx), the free energy is equivalent to a hard clustering distortion. Minimization of this cost with respect to the cluster representatives can be realized for a given prior by an annealing approach, as in the original DA method (Rose ct nl. 1992),
Unsupervised Learning
431
wherein F is minimized starting from high temperature (small p) and the solution is tracked while the temperature is lowered. The annealing process is useful for avoiding local optima of the cost. The condition for optimizing the free energy at any temperature is
or the centroid rule d E C]-d(x,y,)
CP[X X
8%
= 0.
Vi
(2.8)
For the squared distance measure, we may write
which can be iterated until a fixed point is reached at each temperature. Equation 2.9 connects the method with statistical approaches since these iterations are a special instance of the expectation/maximization (EM) algorithm (Dempster ef al. 1977; see also Yuille ef al. 1994). Of course, there may be methods that are more efficient than fixed point iterations for minimizing F at each temperature. While this introduction of a prior within the DA framework may have application to supervised learning,' in this paper we will focus on an unsupervised learning problem (tree-structured clustering) and demonstrate that the inclusion of a prior is especially useful in this context, as it allows explicit quantification of the dependencies between the leaf and nonleaf layers. 2.2 Tree-Structured Formulation. We now relate the previous section's results to structurally constrained clustering. For clarity's sake, we first consider a simple two-layer binary tree and then show that our framework is easily extended to trees of any breadth and depth. Thus, we have a single internal layer with two test vectors SO and s1 (note that we drop the redundant superscript 1 = l), and a leaf layer consisting of four code vectors yo. yl, y?, and y3. Assuming, for simplicity, that squared distance is used both for classification in the internal layer and for measuring the clustering cost at the leaves, the nonleaf decision boundary d(x, so) = d(x,s1) is a hyperplane, dividing the feature space into two regions, which are then further subdivided by the leaf layer. A probabilistic generalization of this hard boundary (justifiable with the 'A special case where { Q [ x E C,]} are interpreted as probabilistic class labels for the training data, arising from uncertainties in a supervised learning process, is related to the work in Buhmann and Kuhnel (1993) and Oehler and Gray (1993), where a supervising term was introduced within the clustering cost based on knowledge of class labels for the data. Although our approach could potentially make a contribution to this application, supervised learning is not addressed in this paper.
432
David Miller and Kenneth Rose
maximum entropy principle) is the Gibbs distribution e-Y’(x,soj
PH[X E SO] = e-rd(x.so) + e - N x . s i )
(2.10)
where y is a positive scale parameter, controlling the fuzziness of the distribution, and where the cell So is the set of all points classified to test vector SO, that is, to node 0 in the internal layer. For SO # s1 and y -+ 00, equation 2.10 reduces to a hard hyperplane decision function. In a similar fashion trees of larger depth can be probabilistically generalized via products of Gibbs distributions. For some internal node s with corresponding cell S we write the recursion formula: e-$(x.sj
PH[x E S] = PH[x E parent(S)]
C
e-yd(x3)
(2.11)
SEsiblings(s j
which applies the Gibbs distribution to divide the probability of classification to the parent node among the siblings. It is easy to see that the corresponding closed-form expression is a product of Gibbs distributions, which, at the limit y + 00, specifies a sequence of nested decisions for a hard decision tree. One strategy for imposing the structural constraint within a probabilistic clustering framework is to view {&[.I} as a prior that influences the formation of leaf representative/data association probabilities. Denote the distribution at the leaves {PL[xE C,]} where partition region C, refers to leaf j represented by code vector y,. The leaf associations should be chosen to “agree” with the prior to the parent layer while satisfying an average clustering distortion constraint. Accordingly, as in the previous section, we pose
subject to (2.13)
Here, the prior is equally split between its K children nodes at the leaves-a choice justified by the principle of maximum entropy. The solution is the tilted distribution (2.14)
Unsupervised Learning
433
and the corresponding free energy is FT = -
(2.15)
The parameterization of {PH[.])guarantees that as y 00 and + 03, { P L[.I} determines a tree-structured partition of the feature space. Moreover, at these limits FTreduces to the tree-structured clustering distortion. The free energy is considered both in our own previous work (Miller and Rose 1994b)and in Jordan and Jacobs (1994). There, it was proposed to maximize the log likelihood (negative of the free energy) with respect to all model parameters. For tree-structured clustering, this would mean optimizing FT with respect to {s:}, y, and {y,). While this approach is appealing because of its general applicability to classification and regression and because of its connection with the EM algorithm, it is not consistent with an annealing approach, wherein P must control the average distortion and data “scale” of the solution. In particular, note that this strategy allows the necessary optimality condition for a test vector to be satisfied at any temperature p by choosing the parameters so as to make j “hard.”’ Hardening of {pH[’] j imposes severe the probabilities {pH[’] restrictions on {PL[.])and hence on the extent to which controls the data scale of the solution. Alternatively, the method we propose retains p as a computational temperature, controlling the degree of ”fuzziness” in the solution and, as will be discussed, the effective model size. Whereas the conventional splitting algorithm treats all nodes as if they were leaves by placing nonleaf (test) vectors as well as the leaf (code) vectors at region centroids, we view the code vectors {y,} and test vectors {s:} as complementary sets of variables. The role of the leaves is to minimize the clustering distortion by placing code vectors {y,>at region centroids, while the role of the internal nodes (hierarchy) is to classify to the leaf layer to minimize clustering distortion at the leaves-in general this will not be achieved by placing test vectors {s:> at region centroids. Clearly, the leaf and nonleaf objectives are intertwined, but we ”decouple” them in the optimization by alternating between the two complementary subproblems, namely, optimize the leaf nodes given a fixed hierarchy, and optimize the internal nodes given the fixed leaf layer. --$
2Consider the two-layer binary tree. A necessary condition for optimizing FT with respect to so is
X
-
which is satisfied when {PH[,]} becomes ”hard,” i.e., { P H [ . ] } {O,l}, due to the dependence of P L on PH as given in equation 2.14. The hardening can be achieved by sending test vectors or y to infinity This undesired phenomenon has been confirmed experimentally.
434
David Miller and Kenneth Rose
2.2.1 Optimization of the Leaves. The dependence of the leaves on the hierarchy is naturally built into FT through the prior {PH[.]}.Thus, given {pH[']}, our method directly minimizes the free energy with respect to {y,} at each p. A necessary optimality condition is (2.17) The minimization can be implemented via gradient descent or, for the squared distance measure, through fixed point iterations (the centroid rule).
2.2.2 Optimization of the Hierarchy. In a corresponding fashion, the test vectors {s'} and scale parameter y should be chosen to produce a distribution /or the hierarchy that "agrees w i t h a (given) distribution over the leaves. Moreover, this optimization should retain p as a computational temperature for the method. Accordingly, we introduce a new paradigm into the problem that essentially states: approximate the opti-
mal (unstructured)clustering solution while imposing the structural constraint. Given fixed (leaf) code vectors, the optimal hard partition is the (unstructured) nearest-neighbor partition induced by the leaves. Within the probabilistic framework, given the "temperature" p, the corresponding ideal distribution is the maximum entropy distribution (2.18) nl
The ideal probability of classification to any internal node S is thus given by (2.19) PI[X E S] = Pl[X E C,I
C
j:C$descendents( S)
The hierarchical parameters {s:} and y should be chosen so that the distribution to the parent layer of the leaves {PH[xE Sf-']} "agrees with" the ideal distribution {PI[. E Sk-']} as nearly as possible. To achieve this objective, we again appeal to cross entropy as a measure of distance between distributions3 and pose mipD({PI[x E sF-']}II{PH[x E r1ls1)
SF-']}) (2.20)
3D(,11.) is an asymmetric measure of distance between distributions; thus some justification of the choice D(P1I IPH)rather than D(PH1101)is due. We view PI as the ideal (i.e., the "true") distribution to be approximated by the model-constrained distribution PH. Thus we average the "log likelihood ratio" with respect to PI: D = C PI [logPI - log PHI; see Cover and Thomas (1991) for more details on this interpretation of D(.]l.). This choice also leads to a simpler result in our case.
Unsupervised Learning
435
For pH[’] recursively defined in equation 2.11, one can easily show that equation 2.20 is equivalent to a minimization problem involving a sum of cross entropies over each nonleaf layer, i.e. (2.21
This problem can be solved by a gradient descent technique. The partial derivatives with respect to any test vector s (with partition cell S) and the scale parameter y are, respectively, (2.22) and (2.23)
Note that equation 2.22 has the same form as the gradient of the free en} {PI[.]}.This simple rule can ergy (see footnote 2), but replaces { P L [ . ]with be interpreted as a probabilistic, batch version of the Perceptron learning approxirule, optimizing the test vectors of the hierarchy so that mates {PI[.]} (Miller and Rose 1994b). Effectively, we have introduced a supervised learning paradigm within an unsupervised problem setting. In a similar fashion, y is modified to match the average distance [variance for d(.) the squared distance] based on {pH[’]} with that of {PI[.]}. Similar gradient rules can also be specified for matrix parameters if Mahalanobis distance is used for classification in nonleaf layers. While the Perceptron does not converge for nonseparable classes, the cross-entropy minimization should converge so long as p is finite.
}I.[&{
2.2.3 Algorithm Summary. Our annealing approach optimizes the parameters of a tree of fixed structure-balanced, with specified depth and branching factor. The method involves alternating minimizations with respect to the leaves and the hierarchy at a sequence of increasing 0, starting from small p. At each p, the optimization consists of four iterated steps: (1) Given a fixed {pH[.]}, choose {y,} to minimize Fr from equation 2.15; (2) compute the ideal distributions {PI[.]} based on the new leaves using equation 2.19; ( 3 ) optimize the hierarchical parameters y and {sf} to agree with { P I [ . ] }in the cross-entropy sense of equation 2.21; and (4) compute {pH[’]} using equation 2.11. These steps are iterated until a convergence condition is satisfied. Then /3 is increased. As the “temperature” decreases, the distributions begin to ”harden,” and upon termination, our method specifies a hard, tree-structured solution through {s;} and {y,}. The algorithm is listed in pseudo-code in Table 1. Several algorithm steps, including initialization and the termination condition, will be explained in the next section. While this study does not
436
David Miller and Kenneth Rose
address theoretical issues of convergence, in practice based on numerous simulations, we have found that convergence is always achieved at a given temperature. 2.3 Growing by Phase Transitions. Even though the approach we have described optimizes parameters of a tree of specified size and structure, this choice does not restrict the effective tree size and structure produced at a given p. At high temperature (P = 01, there is a unique minimum of FT, with all cluster representatives at the global centroid of the data set. The corresponding distributions { P I [ . ] }are uniform, and thus the optimal { P H [ . ] }that minimizes the cross-entropy is also uniform, achieved by choosing all the test vectors to be nondistinct. Thus, at small P the tree effectively collapses to a single node, justifying the initialization of all node vectors to the global centroid (see Table 1). The global centroid is a solution for any /3. However, as B is increased, for some critical value this solution changes from a minimum to a local maximum or a saddle point of FT. Essentially, the increased emphasis given to minimizing clustering distortion for increasing p prompts splits of the representatives and growth in the effective model size. To break the symmetry of the global centroid solution, small perturbation vectors w are added to the representatives to promote splits (see Table 1). At special, critical values of I?, perturbations will grow into actual splits, while at all other p, nondistinct representatives that have been pushed apart by perturbations will return to their nondistinct state. These splits at critical [I can be interpreted via the physical analogy as phase transitions (bifurcations) in the annealing process. Conditions for bifurcation have been derived for the unstructured DA method (Rose et al. 1992; Rose 1994), as well as for the elastic net method for the traveling salesman problem (Durbin et al. 1989). In the Appendix, a condition is derived for splits in the tree-structured annealing process. The first bifurcation is initiated along the principal data axis and at a value of p inversely proportional to the "spread" (the maximum eigenvalue) along this direction (see the Appendix). The initial P for our algorithm in Table 1 is thus chosen to be smaller than this critical value. Subsequent bifurcations occur in a similar fashion, dependent on the data "owned" probabilistically by the representatives undergoing the split. Thus, the annealing process generates a sequence of solutions of increasing effective size and finer scale. As p + 00, the free energy cost becomes the clustering distortion, which can always be decreased by increasing the model size, and so in the limit of low temperature the amount of splitting is limited only by the number of representatives assumed by the system. This choice may depend upon practical requirements (e. g., model complexity issues, bandwidth for data compression application) or on cluster validity measures. While our method does not yield insights into the correct model size for the hard clustering problem, for any finite [j (and hence a probabilistic solution), there is a "correct" model size-
Unsupervised Learning
437
Table 1: Pseudo-Code for the Tree-Structured DA Method?
Given: a data set X = { x } of size M , the balanced tree depth L and branching factor K, a target leaf size Ntarget, an annealing parameter a, and a threshold E .
Calculate the data global centroid p = of the matrix R
=
$7x and principal eigenvalue A,,
CX ( x - p ) ( x - P ) ~ .
Sets:
t
p, for j = 0 ,... ,K' - I ,
Set yj
t
p, for j = O , . .
I = 1 ,..., L.
. , K L - 1. <
I
Set P Pmin and Y Pminr where Pmin 2x,,x. Initialize {PH[.]}to a uniform distribution. do { do { +
N
+
Minimize the free energy FT of equation 2.15 to obtain new leaves: {Yj) argminly,}FT +
Compule Pl[x E S;] V l , j , using equations 2.18 and 2.19. Minimize the cross-entropy of equation 2.21 to obtain a new hierarchy: (7,{$}> + argmin{?,{4}}T D ( { p I k E Sll} 11 { P H b E sf]}) Compute P H [ XE S:], VI, j , using the recursion of equation 2.11. } while AFT/FT > E
P t (1 + .)P Calculate the effective number of leaves (number of distinct elements in { y j } ) : N = I{Y,)I Calculate the entropy H = --$CCPL[X E C,] logPL[x E C,]. Perturb leaf layer: yl t y; if (N< Nta,t) IIWlli2
-
+ wj,
Vj, where wj is a small perturbation,
E.
while ( N < Ntarget) or ( H > E ) 'This algorithm optimizes a tree-shuctured model of fixed size, but the effective tree size grows by bifurcations during the annealing process. our simulations we used a = 0.05 and t = 0,0001. The algorithm is not sensitive to variations in these values so long as they are sufficiently small. Issues of how to select these values so as to optimize the tradeoffs between performance and computation remain open.
~
438
David Miller and Kenneth Rose
that associated with the solution that achieves the global minimum of the free energy. As long as the tree ”thrown into” the optimization is sufficiently large, our method can in principle estimate the optimal tree at each J, with the effective tree size simply the number of distinct representatives in the solution.4 In practice we cannot claim that the optimal model will be found, as our optimization method is not guaranteed to find the global minimum. Still, the effective tree size does grow by bifurcations at critical temperatures in the solution process, and thus our method does provide an estininte of the tree size and structure at each 8. By disallowing splits of parameters (by not introducing perturbations) once a specified effective tree size (Ntaget leaves) has been reached, our algorithm can be used to search for the optimal tree-structured solution (of a priori unknown structure) with a given number of representatives. The algorithm of Table 1 determines the number of distinct representatives at each temperature and terminates when the target tree size has been reached and when the solution is sufficiently “hard” (i. e., when the entropy at the leaves is very low). In Figure 2, we present an example showing an ”evolution” of growth in the tree model for increasing j. The data set is a gaussian mixture with eight components. We optimized a balanced binary tree of depth four (sixteen leaves) and annealed starting from J below the initial critical value. The figures show the converged solutions after critical J of the annealing process have been reached. Here, the effective model size grows from two leaves in Figure 2a to eight leaves in Figure 2g. Note that the partition regions “separating” the leaf representatives are probabilistic-the lines are equiprobable contours with membership probability of p = 0.33, except for Figures 2a and 2g, for which p = 0.5. Figure 2a shows the solution after the initial split from the global centroid. The solution, with 16 leaves, has two distinct leaf representatives, with the eight left subtree leaf vectors all at one location and the eight right subtree leaf vectors all at another location. The associated tree structure is shown to the right of the figure. In Figure 2b, the left subtree has undergone bifurcation, leading to a tree structure with three leaves. The subsequent figures and associated tree diagrams show how the effective tree grows for increasing ,j. The hard clustering solution of Figure 2g was obtained by fixing the model size (disallowing bifurcations) when the target size of eight leaves was achieved and then annealing to low temperature. The tree diagrams emphasize the fundamental differ‘A dependence of the clustering solution on the multiplicity of overlapping representatives within a distinct cluster was noted in Rose et ni. (1992) and also referred to a5 cluster degeneracy in Buhmann and Kuhnel 11994). In Rose et nl. (1993) this weakness was eliminated by introduction of mass variables. However, for the binary-tree constrained design this addition is unnecessary since the tree structure limits the types of hjfurcations that can occur. Nevertheless, this problem theoretically exists for nonhinary trees, although in practical experiments w e have not encountered it. Adopting a ”mass-constrained” modification for the tree-structured problem is beyond the scope { i f this paper.
Unsupervised Learning
439
Figure 2a-c: A hierarchy of tree-structured solutions generated by the annealing method for increasing p. The source is a gaussian mixture with eight components. The figures show the converged solutions at distinct p at which the effective tree size has grown by bifurcation. The computed code vectors are barely visible, denoted by " 0 s . To the right of each figure is the associated effective tree structure. The lines in the figure are equiprobable contours with membership probability of p = 0.33 in a given partition region, except for a and g, for which p = 0.5. "H" denotes the highest level decision boundary in the "hard" solution of g.
David Miller and Kenneth Rose
440
I
Figure 2d-g: Coirtiizzred
Unsupervised Learning
441
Figure 3: An example involving a mixture of four isotropic gaussian components and the two layer binary tree solution found via (a) splitting, with E = 0.9 and (b) tree-structured DA, with E = 0.5. In each figure " X denotes a true mixture component center, " 0denotes a cluster representative found by the method, and "H" denotes the highest level decision boundary of the solution. ence between the "hierarchy" of tree-structured solutions generated by our method and the "hierarchy" of unstructured solutions generated by the original DA method (Rose et al. 1992)-as the tree-structured clustering model grows, so does a corresponding decision tree structure. For the "hard" clustering solution, the decision tree specifies an efficient classification search, which is of practical importance in vector quantization.
3 Results
Here we compare performance of our method with both tree-structured and unstructured clustering methods in both the pattern recognition and data compression contexts. For pattern recognition some simple illustrative examples are used to demonstrate in a fundamental way the im-
David Miller and Kenneth Rose
442
b
Figure 4: An example involving a mixture of eight gaussian components. (a) The splitting solution with E = 0.73 (b) The tree-structured DA solution with E = 0.50.
Unsupervised Learning
443
Figure 5: An example involving a mixture of four isotropic gaussian components and the two layer binary tree solutions. (a) The solution found by treestructured DA using squared distance for the hierarchical layer, with E = 0.43. (b) The tree-structured DA solution using Mahalanobis distance for the hierarchical layer, with E = 0.33. Note that in this case the highest level decision boundary is a hyperbola.
444
David Miller and Kenneth Rose
Figure 6: A gaussian mixture example with sixteen components. (a) The best basic Isodata solution, based on 30 random initializations within the data set, with E = 0.51. (b) A typical basic Isodata solution, with E = 0.72. (c) The unbalanced tree-structured DA solution with maximal depth of five and E = 0.49. provements achievable by our method. In Figures 3-6 the data were generated by randomly sampling from 2D gaussian mixture distributions with isotropic components. In all the figures, X s are used to denote mixture component centers and 0 s to denote computed cluster representatives. Moreover, " H denotes the highest level decision boundary of the solution. The cost function is the sum of squared distances. For the DA method, the annealing parameter o was chosen to be 0.05 and F was set to 10-4.
Unsupervised Learning
445
The example of Figure 3 shows that even for one of the simplest possible trees (two layer, binary) and data sets (clusters along a line) the splitting algorithm fails to adequately discriminate mixture components. Here in Figure 3a, placing test vectors at region centroids leads to the suboptimal boundary ("H), which separates three clusters from one in the first layer. By contrast, our approach places the test vectors so as to achieve a (visually apparent) optimal solution separating all mixture components, with the quality also reflected in a much smaller sum of squared distances cost. Similar performance gains are achieved by our approach in Figure 4. For the example of Figure 5, Mahalanobis distance is used in the nonleaf layer to improve the clustering result and demonstrates that optimal structured solutions need not have convex decision boundaries. In Figure 6, our method is compared with the unstructured basic Isodata algorithm. For this example, despite its structural handicap, our approach achieves a better solution than the best result of basic Isodata based on numerous initializations (30) within the data. While these examples are all fairly simple to aid visual assessment, we have performed extensive testing of our method on a variety of data sources (with data dimensions ranging from 2 to 8) and have found it to be successful in separating complex data distributions with numerous, overlapping mixture components. We have found that our method always outperforms greedy tree-structured methods, including those that use optimal pruning, and very often outperforms unstructured methods such as basic Isodata (which tends to get trapped in nonglobal optima) as well, while reducing search complexity of the resulting classifier. We have also tested our method in comparison with splitting (Buzo et a1. 1980) and with the generalized Lloyd algorithm (GLA) (Linde e t a / . 1980) for vector quantization of Gauss-Markov sources. For this problem our approach bridges a significant portion of the performance gap between splitting for treestructured design and GLA for unstructured design. Some performance results are shown in Table 2 for the case of four-dimensional data vectors.
4 Conclusions
In this paper, we have extended the deterministic annealing approach to address the problem of structurally constrained clustering. Whereas the original approach was developed using the principle of maximum entropy, the new method is based on minimum cross-entropy inference, which is a convenient formalism for expressing the joint objectives of enforcing a tree-structured solution and approximating the optimal (unstructured) solution. The annealing approach is useful in two important respects. First, it is helpful for avoiding local optima of the cost, allowing the method to achieve substantial performance gains over other tree-structured methods. Second, and most importantly, annealing leads
David Miller and Kenneth Rose
446
Table 2: Vector Quantization Performance for the Unconstrained Generalized Lloyd Algorithm and Tree-Structured DA on 4D Gaussian and FirstOrder Gauss-Markov Sources with Correlation Coefficient 0." I)
0.0 0.0 0.0 0.5 0.5 0.5
R
GLA
TREE - D A
1.o 1.25 1.5 1.o 1.25 1.5
0.26 0.83 0.94 0.41 0.68 0.99
0.09 0.29 0.40 0.18 0.40 0.46
"The performance measure is gain (in dB) over the standard splitting algorithm for TSVQ design. Note that a significant portion of the performance gap between splitting for TSVQ design and GLA for unstructured VQ design is recouped by the tree-structured DA method.
to phase transitions in the process and consequential growth in the model size and tree structure. Here, we emphasized that tree growth in our approach is a natural consequence of the optimization of the free energy cost and allows automatic model order and tree structure estimation at each temperature scale. One outgrowth of our method is a probabilistic, batch generalization of the Perceptron algorithm and its connection with minimizing cross-entropy. Another is a basic modification of the clustering problem to incorporate a prior. One clear direction that is beyond the scope of this work is to directly address the more general class of supervised learning problems that includes (nonlinear) piecewise regression and statistical classifier design to minimize probability of error. The tree-structured DA method does not directly address these problems, since clustering is just a special case of regression. However, our approach does successfully combine within an optimization framework important elements of these problems, including test vector variables whose primary role is classification with "representative" variables chosen to minimize the cost. Thus, our approach does provide some impetus for tackling these more general problems. We wish to draw reader attention to more recent work extending the DA method to address supervised learning problems including statistical classification (Miller et nl. 1995).
Unsupervised Learning
447
Appendix A Necessary Condition for Bifurcation Here, we derive the general condition for bifurcation in our tree-structured process, assuming squared distance measures the cost at the leaves. To simplify the sequel, we define the following sets: the representatives Y = {y,}, the perturbations T = { q } ,the perturbed representatives Y p = {y, + EW,}, and the data X = { x } . Here, x,y,, and W~ are elements of the same real vector space and E is a scalar. From the theory of the calculus of variations, necessary conditions satisfied at a minimum of the free energy FT(Y)are
and
Both of these conditions must be satisfied for all permissible perturbations T. The critical Pc at which a phase transition is initiated is a temperature that marks the transition between a solution with positive second derivative for all perturbations T and a solution with zero second derivative for some perturbation. To examine the conditions more closely, we first write out explicitly
and
+ 2PC X
I
p L [ xE I
i'
C,17f(x - Y,)
(A.4)
where we have identified the covariance matrix associated with the C, partition cell:
Now we wish to show that equation A.4 is positive for all perturbations if and only if the matrices {[I- 2pC,]} are positive definite. The "if" part is trivial since the second term in equation A.4 is always nonnegative. To show the "only if" part, suppose that for a pair of nondistinct leaf code vectors satisfying y~= ym (and also C I = Em),the corresponding matrix [I - 2PC[] = [I - 2PCm],is not positive definite. It thus has a nonpositive
448
David Miller and Kenneth Rose
eigenvalue with corresponding eigenvector 11. Choose the perturbation with I " = L I , o,,, = -u, and = 0,b'j # Z,m. Clearly the first term in equation A.4 is nonpositive and the second term is zero. We have thus shown that bifurcation occurs when some of the matrices { [ I - 2&]} stop being positive definite. The critical B is, therefore, /$ = 1/2AmaX, where, , ,A is the largest eigenvalue of XI. Moreover, the split is initiated along the axis of the principal eigenvector of C,. Acknowledgments
This work was supported in part by the National Science Foundation under Grant NCR-9314335, the University of California MICRO program, DSP Group, Inc., Echo Speech Corporation, Moseley Associates, Rockwell International Corporation, Speech Technology Labs, Texas Instruments, Inc., and Qualcomm, Inc. References Ball, G., and Hall, D. 1967. A clustering technique for summarizing multivariate data. Behav. Sci. 12, 153-155. Bezdek, J. C. 1980. A convergence theorem for the fuzzy ISODATA clustering algorithms. I E E E Trans. Putt. Anal. Mach. Intell. PAMI-2, 1-8. Breiman, L., Friedman, J. H., Olshen, R. A,, and Stone, C. J. 1980. Classification and Regression Trees. The Wadsworth Statistics/Probability Series, Belmont, CA. Buhmann, J., and Kuhnel, H. 1993. Complexity optimized data clustering by competitive neural networks. Neural Comp. 5, 75-88. Buhmann, J., and Kuhnel, H. 1994. Vector quantization with complexity costs. I E E E Trans. Inform. Theory 39, 1133-1145. Buzo, A,, Gray, Jr., A. H., Gray, R. M., and Markel, J. D. 1980. Speech coding based on vector quantization. I E E E Trans. Acoiist., Speech, Sig. Proc. 28, 562574. Chou, P., Lookabaugh, T., and Gray, R. M. 1989. Optimal pruning with applications to tree-structured source coding and modeling. I E E E Trans. Inform. Theoy 35, 299-315. Cover, T. M., and Thomas, J. A. 1991. Elements oflnformation Theory. John Wiley, New York. Dempster, A., Laird, N., and Rubin, D. 1977. Maximum-likelihood from incomplete data via the EM algorithm. J. Roy. Stat. SOC.,Ser. B 39, 1-38. Duda, R. O., and Hart, P. E. 1974. Pattern Classificationand Scene Analysis. WileyInterscience, New York. Dunn, J. C. 1974. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 3, 32-57. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (Lotidon) 326, 689691.
Unsupervised Learning
449
Durbin, R., Szeliski, R., and Yuille, A. 1989. An analysis of the elastic net approach to the travelling salesman problem. Neural Comp. 1, 348-358. Farrell, K. R., Mammone, R. J., and Assaleh, K. T. 1994. Speaker recognition using neural networks and conventional classifiers. I E E E Trans. Speech Audio Proc. 2, 194-205. Geiger, D., and Girosi, F. 1991. Parallel and deterministic algorithms from MRFs: Surface reconstruction. I E E E Trans. Patt. Anal. Mach. Intell. 13, 401412. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Paft.Anal. Mach. Intell. PAMI6, 721-741. Gersho, A., and Gray, R. M. 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Boston, MA. Hartigan, J. A. 1975. Clustering Algorithms. John Wiley, New York. Jain, A. K., and Dubes, R. C. 1988. Algorithmsfor Clustering Data. Prentice Hall, Englewood Cliffs, NJ. Jaynes, E. T. 1989. Information theory and statistical mechanics. In Papers on Probability, Statistics and Statistical Physics, R. D. Rosenkrantz ed. Kluwer Academic Publishers, Dordrecht, The Netherlands. (Reprint of the original 1957 papers in Pkys. Rev.) Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6, 181-214. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671-680. Kullback, S. 1969. Information Theory and Statistics. Dover, New York. Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantizer design. I E E E Trans. Commun. COM-28, 84-95. Lippmann, R. P. 1989. Pattern classification using neural networks. IEEE Commun. Mag. 47-64. Lloyd, S. P. 1982. Least squares quantization in PCM. I E E E Trans. Inform. Theory IT-28, 129-137. (Reprint of the 1957 paper.) MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Symp. Math. Stat. Prob. I, 281-297. Miller, D., and Rose, K. 1994a. Combined source-channel vector quantization using deterministic annealing. I E E E Trans. Commun. 42, 347-356. Miller, D., and Rose, K. 1994b. A non-greedy approach to tree-structured clustering. Patt. Rec. Lett. 15, 683-690. Miller, D., Rose, K., and Chou, P. A. 1994. Deterministic annealing for trellis quantizer and HMM design using Baum-Welch re-estimation. I E E E lnt. Conf. Acousf. Speech Sig. Proc., Adelaide, Australia. Miller, D., Rao, A., Rose, K., and Gersho, A. 1995. An information-theoretic learning algorithm for neural-network classification. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Advances in Neural Information Processing 8. MIT Press, Cambridge, Ma. Oehler, K., and Gray, R. M. 1993. Combining image classification and image compression using vector quantization. Proc. l E E E Data Comp. Conf. 2-11. Riskin, E. A., and Gray, R. M. 1991. A greedy growing algorithm for the design of variable rate vector quantizers. IEEE Trans. Sig. Proc. 39, 2500-2507.
450
David Miller and Kenneth Rose
Rose, K. 1994. A mapping approach to rate-distortion computation and analysis. I E E E Trans. Inform. Theory. 40, 1939-1952. Rose, K., Gurewitz, E., and Fox, G. C. 1990. Statistical mechanics and phase transitions in clustering. Phys. Rev. Left. 65, 945-948. Rose, K., Gurewitz, E., and Fox, G. C. 1992. Vector quantization by deterministic annealing. I E E E Trans. Inform. Theory 38, 1249-1258. Rose, K., Gurewitz, E., and Fox, G. C. 1993. Constrained clustering as an optimization method. l E E E Trails. P a f f .Anal. Mach. Intell. 15, 785-794. Shore, J. E., and Johnson, R. W. 1980. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. I E E E Trans. Inform. Theory IT-26, 26-37. Simic, I? D. 1990. Statistical mechanics as the underlying theory of elastic and neural optimization. Network 1,89-103. Sokal, R., and Sneath, I? 1963. Principles ofNiimerica1 Taxonomy. W. H. Freeman, San Francisco. Yuille, A. L. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Comp. 2, 1-24. Yuille, A. L., Stolorz, I?, and Utans, J. 1994. Statistical physics, mixtures of distributions, and the EM algorithm. Neural Comp. 6, 334-340.
Received September 30, 1994; accepted June 8, 1995.
This article has been cited by: 2. X.L. Yang, Q. Song, W.B. Zhang. 2006. Kernel-based deterministic annealing algorithm for data clustering. IEE Proceedings - Vision, Image, and Signal Processing 153:5, 557. [CrossRef] 3. J. Puzicha, M. Held, J. Ketterer, J.M. Buhmann, D.W. Fellner. 2000. On spatial quantization of color images. IEEE Transactions on Image Processing 9:4, 666-682. [CrossRef] 4. Yee Leung, Jiang-She Zhang, Zong-Ben Xu. 2000. Clustering by scale-space filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:12, 1396-1410. [CrossRef] 5. Shai Wiseman, Marcelo Blatt, Eytan Domany. 1998. Superparamagnetic clustering of data. Physical Review E 57:4, 3767-3783. [CrossRef] 6. K. Rose. 1998. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE 86:11, 2210-2239. [CrossRef] 7. Marcelo Blatt , Shai Wiseman , Eytan Domany . 1997. Data Clustering Using a Model Granular MagnetData Clustering Using a Model Granular Magnet. Neural Computation 9:8, 1805-1842. [Abstract] [PDF] [PDF Plus] 8. K. Rose, D. Miller, A. Gersho. 1996. Entropy-constrained tree-structured vector quantizer design. IEEE Transactions on Image Processing 5:2, 393-398. [CrossRef]
Communicated by Nicol Schraudolph
The Interchangeability of Learning Rate and Gain in Backpropagation Neural Networks Georg Thimm Perry Moerland Emile Fiesler IDIAP, CH-1920 Martigny, Switzerland The backpropagation algorithm is widely used for training multilayer neural networks. In this publication the gain of its activation function(s) is investigated. In specific, it is proven that changing the gain of the activation function is equivalent to changing the learning rate and the weights. This simplifies the backpropagation learning rule by eliminating one of its parameters. The theorem can be extended to hold for some well-known variations on the backpropagation algorithm, such as using a momentum term, flat spot elimination, or adaptive gain. Furthermore, it is successfully applied to compensate for the nonstandard gain of optical sigmoids for optical neural networks. 1 Introduction
When using the backpropagation algorithm' to train a multilayer neural network, one is free to choose parameters like the initial weight distribution, learning rate, activation function, network topology, and gain of the activation function. A common choice for the activation function cp of a neuron in a multilayer neural network is the logistic or sigmoid function: n,
cp@) =
I
(1.1)
which has a range (0, y). Alternative choices for cp are a hyperbolic tangent, y tanh(px), yielding output values in the range (-?, ?), and a gaussian function ye-(@')* with range (O,y]. The parameter p is called the gain, and rP the steepness (slope) of the activation function.2 The effect of changing the gain of an activation function is illustrated in Figure 1: the gain scales the activation function in the direction of the horizontal axis. 'See, for example, Rumelhart et al. (1986). 2Note that gain and steepness an? identical for activation functions with y = 1 (Saxena and Fiesler 1995).The term temperature is sometimes used as a synonym for the reciprocal of gain.
Neural Computation 8, 451-460 (1996)
@ 1996 Massachusetts Institute of Technology
G. Thimm, P. Moerland, and E. Fiesler
452
Y
0
0
Figure 1: A logistic and a gaussian function of gain one (solid lines) and their four times steeper counterparts (dotted lines). This publication proves that a relationship between gain, learning rate, and weights in backpropagation neural networks exists; followed by the implications of this relationship for variations of the backpropagation algorithm. Finally, a direct application of the relationship to the implementation of neural networks with optical activation functions with a nonstandard gain is presented. Several other authors hypothesized about the existence of a relationship between the gain of the activation function and the weights (Codrington and Tenorio 1994; Wessels and Barnard 1992),3or between the gain and the learning rate (Kruschke and Movellan 1991; Mundie and Massengill 1992; Zurada 1992; Brown eta!. 1993; Brown and Harris 1994). Specifically, Zurada writes: "leads to the conclusion that using actiuation functions with large [gain] X inay yield results similar as in the case of large learning constant 71. I t thus s e e m advisable to keep X at a standard value of 1, and to
control learning speed using solely the learning coizstant 11. 2 The Relationship between the Gain of the Activation Function, the Learning Rate, and the Weights The theorem below gives a precise relationship between gain, initial weights, and learning rate for two backpropagation neural networks with the same topology and where corresponding neurons have activation 'Wessels and Barnard (1992) use a weight initialization method that scales the initial weight range according to the gain of the activation function.
Backpropagation Neural Networks
453
Table 1: The Relationship between Activation Function, Gain, Weights, and Learning Rate.
Activation function Gain Learning rate Weights
Network M
Network N
q ( x ) = cp(0x)
$4
P
p=1
W
W=pW
v
f=/
= P277
functions cpl and pl (indices are omitted where corresponding functions or variables have the same index): Y ( X )= cp(Px)
(2.1)
This means that corresponding neurons in the two networks have the same activation function, except that the first has gain 1 and the second gain p. A proof of Theorem 1 and a detailed description of the backpropagation algorithm can be found in the appendix. Theorem 1. Two neural networks Mand N of identical topology whoseactivation function (p, gain p, learning rate v,and weights ware related to each other as given in Table 2 are equivalent under theon-line backpropagationalgorithm; that is, when presented the same pattern set in the same order, their outputs are identical. An increase of the gain with a factor 0 can therefore be compensated for by dividing the initial weights by ,B and the learning rate by p2. 3 Extensions and Applications of Theorem 1
Many variations of the standard backpropagation algorithm are in use. A list of common variations and their interdependence with Theorem 1 is presented here. The corresponding proofs are omitted, as they are analogous to the proof of Theorem 1. Momentum (Rumelhart et al. 1986): Theorem 1 holds when both networks have identical momentum terms. Batch or off-line learning: Theorem 1 holds without modification of the network parameters when off-line learning is used. Flat spot elimination (Fahlman 1988): Theorem 1 holds if the constant C (0.1 in Fahlman’s paper) added to the derivative in network N equals clp. Weight discretization with multiple thresholding of the real-valued weights (Fiesler et al. 1995) requires an adaptation of the threshold values for the weight discretization. If d and d are the discretization functions applied on the weights, Theorem 1 holds if VX : Pd(x) = ~(Px).
454
G. Thimm, P. Moerland, and E. Fiesler
Adaptive gain (Plaut e f a l . 1986; Bachmann 1990; Kruschke and Movellan 1991): A change -1 -I of the gain can be expressed as a change of the learning rates from J 2 r l to ( . I + . L 9 ) 2 r l , and of the weights from 3w to (J+ -1 j)w, without changing the gain of the activation functions. The use of steeper activation functions for decreasing the convergence time (Izui and Pentland 1990; Cho et 01. 1991) is equivalent to using a higher learning rate and a bigger weight range according to Theorem 1. Approaching hard limiting thresholds by increasing the gain of the activation functions (Corwin et nl. 1994; Yu et a / . 1994) IS similar to multiplying the weights with a constant greater than one. In the final stage of the training process the activation functions can be replaced by a threshold if this does not cause a degradation in performance. A major problem in using optical activation functions is their nonstandard gain (Saxena and Fiesler 1995).In Figure 2, an optical activation function and its approximation by a shifted sigmoid, with a gain of approximately 1/24, are depicted. Note that the domain of the optical activation function is restricted to positive values due to constraints imposed by the optical implementation. Using this optical activation function in a backpropagation algorithm with a normal learning rate, say 0.3, and a normal initial weight interval, say [-0.5.0.5], leads to very slow convergence. Theorem 1 explains why: this choice of parameters corresponds to having an activation function of gain one and a small learning rate of (1/24)* x 0.3 = 0.00052. Theorem 1 gives a way to overcome this problem: choose a learning rate of 24’ x 0.3 = 172.8 and an initial weight interval of [-(24 x 0.5).24 x 0.51. The neural network using these adopted parameters performed well. 4 Conclusions
The gain of the activation function and the learning rate in backpropagation neural networks are exchangeable. More precisely, there exists a well-defined relationship between the gain, the learning rate, and the initial weights. This relationship is presented as a theorem that is accompanied by a detailed, yet easy to understand, proof. The theorem also holds for several variations of the backpropagation algorithm, like the use of momentum terms, adaptive gain algorithms, and Fiesler’s weight discretization method. A direct application area of the theorem is analog neural network hardware implementations, where the possible activation functions are limited by the available components. One can compensate for their nonstandard gain by modifying the learning rate and initial weight according to the theorem to optimize the performance of the neural network.
Backpropagation Neural Networks
455
Figure 2: An optical activation function (solid line) and its approximation by a shifted sigmoid (dotted line). 5 Appendix
Before proving theorem 1 a generalization of the on-line backpropagation learning rule (Rumelhart et al. 1986) is described, in which every neuron has its own (local) learning rate and activation function. The standard case of a unique learning rate and activation function corresponds to all local learning rates and activation functions being equal for the whole network. The following notation and nomenclature is used (Fiesler 1994): a (multilayer) neural network can have an arbitrary number of layers denoted by L. The number of neurons in layer 1 ( 1 5 C 5 L ) is denoted by N! and the neurons in layer l are numbered from 1 up to N t . Layer 1 is the input layer and layer L is the output layer of the network. Adjacent layers are fully interlayer connected. The weight from neuron i to neuron j in the next layer e is denoted by We,lJ. The activation value of this neuron is indicated as UeJ ( j > 0), and t, denotes the target pattern value for output neuron j . To simplify the notation the convention is used that Ut-1,o = 1 and the bias (or offset) is we,^,^. The backpropagation algorithm is described by the following five steps: 1. Initialization: Weights and biases are initialized with random values4 4See Thimm and Fiesler (1995) for an in-depth study of weight initialization.
G. Thimm, P. Moerland, and E. Fiesler
456
2. Pattern presentation: An input pattern, which is used to initialize the activation values of the neurons in the input layer, and its corresponding target pattern are presented. 3. Forward propagation: During this phase, the activation values of the input neurons are propagated layer wise through the network. The new activation value of neuron j in layer t ( 2 5 8 5 L ) is '!J
=
{
1 cpe,,(nete,,)
ifj=O otherwise
(5.1)
where the input of a neuron j , not in the input layer, is defined as (5.2) where pe,l is a differentiable activation function, for example, the logistic function (equation 1.1). 4. Backward propagation: For each neuron an error signal 6 is calculated, starting at the output layer and then propagating it back through the network: 6L,/ - PL,/(netL./) (t/ - ' L J ) = cph,,(nete,l)Ck &+l,kWe+l,/,k
for the output layer L for layers 2 I 8 I L - 1 (5.3)
After the calculation of all error signals, the weights and biases are updated: := w e > , >+ / ~e,,6e,lae-i,1
(5.4)
where lie,, denotes the learning rate of neuron j in layer 8. 5. Convergence test: If no convergence go to Pattern presentation. The notation introduced in the formulation of the backpropagation algorithm permits the proper formulation of the Proof of Theorem 1. To simplify the notation, the vector of incoming weights of neuron j is denoted by We,] and the vector of activation values of layer C by a!. Now, using this notation, equation 5.2 can be rewritten as
is the inner product operator. The variables of network N where (the network with activation functions with gain one) are overlined; for example: net!,, = We,j . St-1. The proof of Theorem 1 is separated into two parts. Lemma 1 deals with the forward propagation: the networks have the same output for the same input pattern. Lemma 2 deals with the backward propagation: the conditions for Lemma 1 still hold after a training step. I'."
Backpropagation Neural Networks
457
Lemma 1. Two networks M and N , satisfying the preconditions given in Theorem 1, have the same activation values for corresponding neurons, that is a! = Se (for all 0, if al = 2, (is forward propagated).
Proof. By induction on the number of layers, starting at the input layer. Induction base: The activation values of the input layer neurons of networks, M and N are identical, since the input patterns are identical (& = at). Induction step: For neuron j , not in the input layer: at)j = ae.j
ve,j(nete,j)= cPe,j("ett,j)
using equation 5.1, trivially fulfilled for j using equation 2.1:
= 0.
ve,;(x) = Fe,j(Pe,,x) -e
Fe,j(Pe,jnete,j) = iPe,j(=t,j) Pe,j nete,j = nett,,
@
Pe,j(we,j. ae-1) = we,j. ae-1
@
&(we,; . ae-1)
@
using equation 5.5 on account of = pe,;wt,j
= (Pt.jwe,l). ae-1
which is true on account of the induction hypothesis that the activation values in the lower layers are identical. Note that in the course of the proof it has also been shown that pe,j nett,, = S t , , . 0 = Pe,jwe,jis used. Since the In the proof of Lemma 1 the property We,, backward propagation changes the weights, it has to be shown that this property is an invariant of the backward propagation step.
Lemma 2. Consider networks M and N , with Ze,j = atd and st, =; @e,jnetp,j(for all .t and j), then V j , e : mt,j = Pe,jwe,l
(5.6)
is invariant under the backward propagation step (if the same input and target patterns are propagated through the networks). Proof. Let AWQ denote the weight change qt,j6e,jat-1. One observes that equation 5.6 holds if and only if AWe,j = /%,,Awe,,(for all j and 0. Manipulating this expression:
*
AWe,j = Pe,jAwe,j Ve,j6,,j4-1 = Pe3j7/t>;6e>jae-1
definition of Awt,j since qe,j = /3&ve,j and Ze
= at
+ P&qe,jSe,j = Pe,jqe,j6e,j hence, it needs to be shown that /3e,j$t,j = 6ej, which is done by an induction on the number of layers, starting at the output layer.
458
G. Thimm, P. Moerland, and E. Fiesler
Induction step: For a neuron j not in the output layer (t < L):
An induction over the number of pattern presentations, using these 0 lemmas, concludes the proof of Theorem 1.
Addendum For completeness the authors would like to include the reference to a letter in Neural Networks @a et al. 1994) that was brought to their attention after the submission of this paper to Neural Computation, in which a similar theorem is presented, albeit without proof or applications. The theorem includes momentum and is related to Izui and Pentland (1990) and Sperduti and Starita (1993) by the authors.
References
Bachmann, C. M. 1990. Learning and generalization in neural networks. Ph.D. thesis, Department of Physics, Brown University, Providence, RI. Brown, M., An, P. C., Harris, C. J., and Wang H. 1993. How biased is your multi-layer perceptron. World Congress Neural Networks 3, 507-511. Brown, M., and Harris, C. 1994. Neurofuzzy adaptive modelling and control. In Prentice Hall International Series in Systemsand Control Engineering, M. J. Grimble, ed. Prentice-Hall, Englewood Cliffs, NJ.
Backpropagation Neural Networks
459
Cho, T.-H., Conners, R. W., and Araman, P. A. 1991. Fast back-propagation learning using steep activation functions and automatic weight reinitialization. Proc. 1991 IEEE Int. Conf. Systems, Man, Cybernetics: Decision Aiding for Complex Syst. 3, 1587-1592. Codrington, C., and Tenorio, M. 1994. Adaptive gain networks. Proc. I E E E Int. Conf. Neural Networks (ICNN94), 1, 339-344. Convin, E., Logar, A,, and Oldham, W. 1994. An iterative method for training multilayer networks with threshold functions. I E E E Trans. Neural Networks 5(3), 507-508. Fahlman, S. E. 1988. A n Empirical Study of Learning Speed in Backpropagation Networks. Tech. Rep. CMU-CS-88-162, School of Computer Science, Carnegie Mrllon University, Pittsburgh, PA. Fiesler, E. 1994. Neural network classification and formalization. In Computer Standards 6 Interfaces, special issue on Neural Network Standardization, J. Fulcher, ed., Vol. 16, No. 3, pp. 231-239. North-Holland/Elsevier Science Publishers, Amsterdam, The Netherlands. Fiesler, E., Choudry, A., and Caulfield, H. J. 1996. A universal weight discretization method for multi-layer neural networks. l E E E Transactions on Systems, Man, and Cybernetics (IEEE-SMC) (conditionally accepted for publication). See also Fiesler, E., Choudry, A., and Caulfield, H. J. 1990. A weight discretization paradigm for optical neural networks. Proc. Int. Cong. Optical Sci. Eng. SPIE 1281, 164-173. Izui, Y., and Pentland, A. 1990. Analysis of neural networks with redundancy. Neural Comp. 2, 226-238. Jia, Q., Hagiwara, K., Toda, N., and Usui, S. 1994. Equivalence relation between the backpropagation learning process of an FNN and that of an FNNG. Neural Networks 7(2), 411. Kruschke, J. K., and Movellan, J. R. 1991. Benefits of gain: Speeded learning and minimal hidden layers in backpropagation networks. I E E E Trans. Syst. Man Cybern. 21(1), 273-280. Mundie, D. B., and Massengill, L. W. 1992. Threshold non-linearity effects on weight-decay tolerance in analog neural networks. Proc. Int. Joint Conf.Neural Networks (lJCNN92)2, 583-587. Plaut, D., Nowlan, S., and Hinton, G. 1986. Experiments on Learning by Back Propagation. Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, Pittsburgh. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations, pp. 318-362. MIT Press, Cambridge, MA. Saxena, I., and Fiesler, E. 1995. Adaptive multilayer optical neural network with optical thresholding. In Optical Engineering, special on Optics in Switzerland, P. Rastogi, ed., Vol. 34(8), pp. 2435-2440. Sperduti, A,, and Starita, A. 1993. Speed up learning and network optimization with extended backpropagation. Neural Networks 6, 365-383. Thimm, G., and Fiesler, E. 1996. Weight initialization for high order and multilayer perceptrons. I E E E Trans. Neural Networks. (conditionally accepted for
G. Thimm, P. Moerland, and E. Fiesler
460
publication). See also Thimm, G., and Fiesler, E. 1994. Weight initialization for high order and multilayer perceptrons. In Proceedings of the '94 SIPARWorkshop on Parallel a i d Distributed Cornpiititig, M. Aguilar, ed., pp. 91-94. Institute of Informatics, University Perolles, Chemin du Musee 3, Fribourg, Switzerland. SI Group for Parallel Systems. Wessels, L. E A,, and Barnard, E. 1992. Avoiding false local minima by proper initialization of connections. IEEE Trans. Neural Netzuorks 3, 899-905. Yu, X., Loh, N., and Miller, W. 1994. Training hard-limiting neurons using backpropagation algorithm by updating steepness factors. Proc. IEEE Int. Cot7f. Neural Networks (lCNN94) 1,526-530. Zurada, J. M. 1992. Iizfroducfiori fo Arfificial Neural S y s t e m . West Publishing Company, St. Paul, MN. -
_
_
_
~
~ _ _
Received November 29, 1994, accepted March 13, 1995
This article has been cited by: 2. M. USAJ, D. TORKAR, M. KANDUSER, D. MIKLAVCIC. 2010. Cell counting tool parameters optimization approach for electroporation efficiency determination of attached cells in phase contrast images. Journal of Microscopy no-no. [CrossRef] 3. Maciej Pedzisz, Danilo P. Mandic. 2008. A Homomorphic Neural Network for Modeling and PredictionA Homomorphic Neural Network for Modeling and Prediction. Neural Computation 20:4, 1042-1064. [Abstract] [PDF] [PDF Plus] 4. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 5. William J. Egan, S. Michael Angel, Stephen L. Morgan. 2001. Rapid optimization and minimal complexity in computational neural network multivariate calibration of chlorinated hydrocarbons using Raman spectroscopy. Journal of Chemometrics 15:1, 29-48. [CrossRef] 6. Danilo P. Mandic , Jonathon A. Chambers . 2000. Relationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural FiltersRelationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural Filters. Neural Computation 12:6, 1285-1292. [Abstract] [PDF] [PDF Plus] 7. Danilo P. Mandic , Jonathon A. Chambers . 1999. Relating the Slope of the Activation Function and the Learning Rate Within a Recurrent Neural NetworkRelating the Slope of the Activation Function and the Learning Rate Within a Recurrent Neural Network. Neural Computation 11:5, 1069-1077. [Abstract] [PDF] [PDF Plus] 8. G. Thimm, E. Fiesler. 1997. High-order and multilayer perceptron initialization. IEEE Transactions on Neural Networks 8:2, 349-359. [CrossRef]
ARTICLE
Communicated by Chris Bishop and Fernando Pineda
A Smoothing Regularizer for Feedforward and Recurrent Neural Networks Lizhong Wu John Moody Computer Science Department, Oregon Graduate Institute, Portland, OR 97291-1000 USA We derive a smoothing regularizer for dynamic network models by requiring robustness in prediction performance to perturbations of the training data. The regularizer can be viewed as a generalization of the first-order Tikhonov stabilizer to dynamic models. For two layer networks with recurrent connections described by
Y(f) = f[WY(f - T ) + vx(f)],
z ( t ) = UY(f)
the training criterion with the regularizer is
where @ = { U . V. W} is the network parameter set, Z ( t ) are the targets, I ( t ) = { X ( s ) .s = 1.2. . . . . f} represents the current and all historical input information, N is the size of the training data set, p:(@) is the regularizer, and X is a regularization parameter. The closed-form expression for the regularizer for time-lagged recurrent networks is
where 1) 11 is the Euclidean matrix norm and y is a factor that depends upon the maximal value of the first derivatives of the internal unit activations f ( ). Simplifications of the regularizer are obtained for simultaneous recurrent nets ( T H 0), two-layer feedforward nets, and one layer linear nets. We have successfully tested this regularizer in a number of case studies and found that it performs better than standard quadratic weight decay. 1 Introduction
One technique for preventing a neural network from overfitting noisy data is to add a regularizer to the error function being minimized. RegNeural Computation 8, 461-489 (1996)
@ 1996 Massachusetts Institute of Technology
Lizhong Wu and John Moody
462
ularizers typically smooth the fit to noisy data.' Well-established techniques include ridge regression (see Hoerl and Kennard 1970a,b), and more generally spiine smoothing functions or Tikhonov stabilizers that penalize the mth-order squared derivatives of the function being fit, as in Tikhonov and Arsenin (1977), Eubank (1988), Hastie and Tibshirani (1990), and Wabba (1990). These methods have recently been extended to networks of radial basis functions (Powell 1987; Poggio and Girosi 1990; Girosi et nl. 1995) and several heuristic approaches have been developed for sigmoidal neural networks, for example, quadratic weight decay (Plaut et nl. 1986), weight elimination (Rumelhart 1986; Scalettar and Zee 1988; Chauvin 1990; Weigend et nl. 1990), soft weight sharing (Nowlan and Hinton 1992),and curvature-driven smoothing (Bishop 1993).2Quadratic weight decay (which is equivalent to ridge regression) and weight elimination are frequently used "on-line" during stochastic gradient learning. The other regularizers listed above are not generally used with on-line algorithms, but rather with "batch" or deterministic optimization methods. All previous studies on regularization have concentrated on feedforward neural networks. To our knowledge, recurrent learning with regularization has not been reported before. In Section 2 of this paper, we develop a smoothing regularizer for general dynamic models, which is derived by considering perturbations of the training data. We demonstrate that this regularizer corresponds to a dynamic generalization of the well-known first-order Tikhonov stabilizer. We then present a closed-form expression for our regularizer for two layer feedforward and recurrent neural networks, with standard weight decay being a special case. In Section 3, we evaluate our regularizer's performance on a number of applications, including regression with feedforward and recurrent neural networks and predicting the U S . Index of Industrial Production. The advantage of our regularizer is demonstrated by comparing to standard weight decay in both feedforward and recurrent modeling. Finally, we discuss several related questions and conclude our paper in Section 4.
2 Smoothing Regularization
2.1 Prediction Error for Perturbed Data Sets. Consider a training data set (P : Z(t ) . X(t ) } , where the targets Z ( t ) are assumed to be gener~
~~~
'Other techniques to prevent overfitting include early stopping of training [which CAI\ he viewed as having an effect similar to weight decay (SjCiberg and Ljung 1992,1995)l and using prior knowledge in the form of hints (see Abu-Mostafa 1995; Tresp e t n l . 1993, and rcferences therein). Smoothing regularizers can be viewed as a special class of hints. 'Two additional papers related to ours, but dealing only with feedforward networks, came t o our attention or were written after our work was completed. These are Bishop (1995) and Leen (1995). Also, Moody and Rognvaldsson (1995) have recently proposed two new classes of smoothing regularizers for feed forward nets.
Smoothing Regularizer for Recurrent Networks
463
ated by an unknown dynamic system F*[I(t)]and an unobserved noise process:
Z ( t ) = F * [ I ( t ) ]+ E * ( f )
with I ( t ) = { X ( s ) . s = 1 . 2 . . . . . t }
(2.1)
Here, I ( t ) is the information set containing both current and past inputs X ( s ) , and the E * ( f ) are independent random noise variables with zero mean and variance a*’. Consider next a dynamic network model Z ( t ) = F [ @ .I(t)]to be trained on data set P, where @ represents a set of network parameters, and F ( ) is a network transfer function, which is assumed to be nonlinear and dynamic. We assume that F ( ) has good approximation capabilities, such that F [ @ p , I(f)]M F * [ I ( f ) ]for learnable parameters @p. Our goal is to derive a smoothing regularizer for a network trained on the actual data set P that in effect optimizes the expected network performance (prediction risk) on perturbed test data sets of form {Q : Z(t),X ( t ) } . The elements of Q are related to the elements of P via small random perturbations E , ( f ) and E,(t), so that
Z(t)
=
Z ( t )+ E,(t)
(2.2)
X ( t ) = X ( t ) +E,(f)
(2.3)
The E2(f) and E x ( t ) have zero mean and variances g,’ and a:, respectively. The training and test errors for the data sets P and Q are
DP
1
=
jq
c{Z(f) ”
-F[@p.I(t)]}2
f=l
DQ
l N
=
-
C{Z(t) -F[@~,i(f)]}~ t=l
where @P denotes the network parameters obtained by training on data set P, and j ( t ) = { X ( s ) ,s = 1 , 2 , .. . , t } is the perturbed information set of Q. With this notation, our goal is to minimize the expected value of DQ, while training on DP. Consider the prediction error for the perturbed data point at time t :
d ( t ) = { Z ( t )- F[@p,I(t)]}’
(2.6)
Lizhong Wu and John Moody
464
Assuming that E , ( f ) is uncorrelated with { Z ( t )-F[@.p. i(t)]} and averaging over the exemplars of data sets P and Q, equation 2.7 becomes
x;,
[ ~ ~ ( tin) equation ]~, 2.8 is independent of the weights, The third term, so it can be neglected during the learning process. The fourth term in equation 2.8 is the cross-covariance between { Z ( t ) - F [ @ p , l ( t ) ] } and { F [ @ . p . I ( t ) -F[@p?i(f)]}. ] We argue in Appendix A that this term can also be neglected.
2.2 The Dynamic Smoothing Regularizer and Tikhonov Correspondence. The above analysis shows that the expected test error DQ can be minimized by minimizing the objective function D:
In equation 2.9, the second term is the time average of the squared disturbance IlZ(t) - Z(t)1I2 of the trained network output due to the input perturbation Ilj(f) - I ( t ) l 12. Minimizing this term demands that small changes in the input variables yield correspondingly small changes in the output. This is the standard smoothness prior, namely that if nothing else is known about the function to be approximated, a good option is to assume a high degree of smoothness. Without knowing the correct functional form of the dynamic system F' or using such prior assumptions, the data fitting problem is ill-posed. It is straightforward to see that the second term in equation 2.9 corresponds to the standard first order Tikhonov ~tabilizer.~ Expanding to , expectation value of this term first order in the input perturbations E ~ the becomes
3Bishop (1995) has independently made this observation for the case of feedforward networks.
Smoothing Regularizer for Recurrent Networks
465
If the dynamics are trivial, so that the mapping F* has no recurrence, then (2.11)
and equation 2.10 reduces to (2.12)
This is the usual first-order Tikhonov stabilizer weighted by the empirical distribution. In equations 2.12 and 2.10, a : plays the role of a regularization parameter X that determines the compromise between the degree of smoothness of the solution and its fit to the noisy training data. This is the usual bias/variance tradeoff (see Geman et al. 1992). A reasonable choice for the value of 0,' is to set it proportional to the average squared nearest neighbor distance in the input data. For normalized input data (e.g., where each variable has zero mean and unit variance), one can estimate the average nearest neighbor distance as % 0,' %
KNp2lv
(2.13)
where 2) is the intrinsic dimension of the input data (less than or equal to the number of input variables) and K is a geometric factor (of order unity in low dimensions). To summarize this section, the training objective function D of equation 2.9 can be written in approximate form as
where the second term is a dynamic generalization of the first-order Tikhonov stabilizer. 2.3 Form of the Proposed Smoothing Regularizer for Two Layer Networks. Consider a general, two layer, nonlinear, dynamic network with recurrent connections on the internal layer4 as described by
Y(t) Z(fj
=
f [WY(t - 7) + V X ( t ) ]
=
UY(t)
(2.15)
where X(t), Y(f), and Z ( t ) are, respectively, the network input vector, the hidden output vector, and the network output; @ = { U , V, W) is the output, input, and recurrent connections of the network, f ( ) is the vector-valued nonlinear transfer function of the hidden units, and T is 4 0 ~ derivation r can easily be extended to other network structures.
Lizhong Wu and John Moody
466
a time delay in the feedback connections of the hidden layer, which is predefined by a user and will not be changed during learning. T can be zero, a fraction, or an integer, but we are interested in the cases with a small T . ~ When r = 1, our model is a recurrent network as described by Elman (1990) and Rumelhart et al. (1986) (see Fig. 17 on p. 355). When T is equal to some fraction smaller than one, the network evolves 1/ r times within each input time intervaL6 When r decreases and approaches zero, our model is the same as the network studied by Pineda (1989), and earlier, widely studied recurrent networks7 (see, for example, Grossberg 1969; Amari 1972; Sejnowski 1977; Hopfield 1984). In Pineda (1989), T was referred to as the network relaxation time scale. Werbos (1992) distinguished the recurrent networks with zero T and nonzero T by calling them simultaneous recurrent networks and time-lagged recurrent networks, respectively. We show in Appendix B that minimizing the second term of equation 2.9 can be obtained by smoothing the output response to an input perturbation at every time step, and we have
IlZ(t) - Z(t)1I25 p $ ( @ p ) I I X ( t )- X(t)ll’
for t
= 1 , 2 , .. . , N
(2.16)
We call p T 2 ( @ p ) the output sensitivity of the trained network @p to an input perturbation. p T 2 ( @ p )is determined by the network parameters only and is independent of the time variable t. We obtain our new regularizer by training directly on the expected prediction error for perturbed data sets Q. Based on the analysis leading to equations 2.9 and 2.16, the training criterion thus becomes (2.17) As in equation 2.14, the coefficient X in equation 2.17 is a regularization parameter that measures the degree of input perturbation I II(t) - I ( t )I 12. Note that the subscript P has been dropped from @, since D is now the training objective function for any set of weights. Also note in comparing equation 2.17 to equation 2.14 that the sum over the past history indexed by s no longer appears, and that a trivial factor l/NCL, = 1 has been dropped. These simplifications are due to our minimizing the zero-memory response at each time step during training as described after equation B.23 in Appendix B. 5When the time delay T exceeds some critical value, a recurrent network becomes unstable and lies in oscillatory modes. See, for example, Marcus and Westervelt (1989). 6When T is a fraction smaller than one, the hidden node‘s function can be described by Y(t+kT-l) = f { W Y [ t + ( k - l ) ~ - l ] + V X ( t ) } The input X ( t ) is kept fixed during the above evolution. 7These were called additive networks.
f o r k = 1 , 2 , . . . , 1/7.
Smoothing Regularizer for Recurrent Networks
467
The algebraic form for p,(@) as derived in Appendix B is (2.18) for time-lagged recurrent networks (7 > 0). Here, ( 1 11 denotes the Euclidean matrix norm.8 The factor y depends upon the maximal value of the first derivatives of the activation functions of the hidden units and is given by
where j is the index of hidden units and o j ( t ) is the input to the jth unit. In general, y 5 l.9 To ensure stability and that the effects of small input perturbations are damped out, it is required that YIIWII < 1
(2.20)
The regularizer equation 2.18 can be deduced for the simultaneous recurrent networks in the limit T H 0 by (2.21)
If the network is feedforward, W become
=0
and
T
P ( @ ) = -/I1~llIlV1I
= 0,
equations 2.18 and 2.21 (2.22)
Moreover, if there is no hidden layer and the inputs are directly connected to the outputs via U, the network is an ordinary linear model, and we obtain P ( @ ) = IIUII
(2.23)
which is standard quadratic weight decay (Plaut et al. 1986) as is used in ridge regression (see Hoerl and Kennard 1970a,b). The regularizer (equation 2.22 for feedforward networks and equation 2.18 for recurrent networks) was obtained by requiring smoothness of the network output to perturbations of data. We therefore refer to it as sThe Euclidean norm of a real M x
N matrix A is
where AT is the transpose of A and ulj is the element of A. 9For instance, f ’ ( x ) = [I - f ( x ) l f ( x ) if f ( x ) = 1/(1 + e P ) . In this case, y = max 1 f ’ ( x ) ) I= 1/4 at x = 0. If 1x1 is much larger than 0, then f ( x ) operates in its asymptotic region, and I f’(x) I will be far less than 1/4. In fact, y is exponentially small in this case.
468
Lizhong Wu and John Moody
a smoothing regularizer. Several approaches can be applied to estimate the regularization parameter A, as in Eubank (19881, Hastie and Tibshirani (19901, and Wahba (1990). We will not discuss this subject in this paper. After including a regularization term in training, the weight update equation becomes
A@
=
VV&
-
= -rl{VaDp
+ XOa[pi2(@)]}
(2.24)
where 11 is a learning rate. With our smoothing regularizer, 0,[p,2(@)] is computed as (2.25) (2.26)
where I{;,, u,,, and wij are the elements of U, V , and W, respectively. For simultaneous recurrent networks, equation 2.27 becomes (2.28) When standard weight decay is used, the regularizer for equation 2.15 is (2.29) The corresponding update equations for this case are (2.30) (2.31) DpZ ~
dWjj
=
2w;j
(2.32)
In contrast to our smoothing regularizer, quadratic weight decay treats all network weights identically, makes no distinction between recurrent and input/output weights, and takes into account no interactions between weight values. In the next section, we evaluate the new regularizer in a number of tests. In each case, we compare the networks trained with the smoothing regularizer to those trained with standard weight decay.
Smoothing Regularizer for Recurrent Networks
469
3 Empirical Tests
In this section, we demonstrate the efficacy of our smoothing regularizer via three empirical tests. The first two tests are on regression with some synthetic data, and the third test is on predicting the monthly US. Index of Industrial Production. 3.1 Regression with Feedforward Networks. We form a set of data generated by a predefined function G. The data are contaminated by some degree of zero-mean gaussian noise before being used for training. Our task is to train the networks so that they estimate G. We will first study function estimation with feedforward networks, and then extend it to the case with recurrent networks.
3.1.1 Data. The data in this test are synthesized with the function
where x is uniformly distributed within [-10:10], E is normally distributed with zero mean and variance u’, and a and b are two constants. In our test, we set a = 1 and b = 5.
3.1.2 Model. The model we have used for the above data is a twohidden unit, feedforward network with sigmoidal functions at hidden units and a linear function at a single output unit. It can be described as (3.2)
The model overall has 7 weight parameters. 3.2.3 Performance Measure. The criterion to evaluate the model performance is the true mean squared error (MSE) minus the noise variance 02:
(3.3)
where G ( x ) is the noiseless, source function as shown in the first part of equation 3.1, Z ( x ) is the network response function as given by equation 3.2, and p(x) is the probability density of x. In this experiment, p(x) is uniformly distributed within [-xg,xo] and xo = 10.
Lizhong Wu and John Moody
470
Table 1: Comparison of the Performances (as Measured by equation 3.3) between the Feedforward Networks Trained with the Smoothing Regularizer and Those Trained with Standard Weight Decay for the Function Estimation." Number of training patterns
Noise variance
With standard
With smoothing
weight decay
regularizer
11
0.1 0.5 1 .0
0.037i0.011 0.137i 0.003 0.151 i 0.000
0.020i0.003 0.076i 0.028 0.117i 0.011
21
0.1 0.5 1.0
0.014i, 0.003 0.061i0.004 0.097i 0.005
0.010i 0.000 0.048 0.042 0.068i 0.009
41
0.1 0.5 1 .o
0.011 0.000 0.038-C 0.001 0.066i 0.001
-
*
*
0.008i 0.000 0.028i 0.000 0.050& 0.000
.'The results showit are the mean and the standard deviation over 10 models with difierent initial weights.
3.1.4 R e s d t s . Comparisons between the smoothing regularizer and standard weight decay are listed in Table 1. The performance comparisons are made for a number of cases. The numbers of training patterns are varied from 11, 21, and 41. The noise variances are from 0.1, 0.5, and 1.0. To observe the effect of the regularization parameters, we did not use their estimated values. Instead, the regularization parameters for both the smoothing regularizer and standard weight decay are varied from 0 to 0.1 with step-size 0.001. Figure 1 shows the downsampled training and test errors versus the regularization parameters. The performances in Table 1 are the optimal results over all these regularization parameters. This gives the best potential result each network can obtain. Unlike our other tests in real world applications, neither early stopping nor validation was applied in this test. Each network was trained over 5000 epochs. It was found that for all networks, the training error did not decrease significantly after 5000 training epochs. With these conditions, the task to prevent the network from overtraining or overfitting is completely dependent on the regularization. We believe that such results will more directly reflect, and more precisely compare, the efficacy of different regularizers. Table 1 shows that the potential predictive errors with the smoothing regularizer are smaller than those with standard weight decay. Figure 2 gives an example and compares the approximation functions obtained with standard weight decay and our smoothing regularizer to the true function. We can see that the function obtained with our smoothing regularizer is obviously closer to the true function than that obtained with standard weight decay.
Smoothing Regularizer for Recurrent Networks
471
0.55 0.5 0.45 -
5
With Weight Decay
L
W
.-? c .E c
0.4-
0.350.3With Smoothing Reg.
1 o-2
lo-'
Regularization Parameters
0.18
I
With Weight Decay
0.1 -
0.08 -
0.06 '
Io
-~
1 O-z
10.'
Regularization Parameters
Figure 1: Training (upper panel) and test (lower panel) errors versus regularization parameters. Networks trained with ordinary weight decay are plotted by "+," and those trained with smoothing regularizers are plotted by "*." Ten different networks are shown for each case. The curves are the median errors of these 10 networks.
472
Lizhong Wu and John Moody
+
+ +
1.5t
+
+
+
I +
t
+
+
+
Figure 2: Comparing the estimated function obtained with our smoothing regularizer (dashed curve) and that with standard weight decay (dotted curve) to the true function (solid curve). The ”+” plots 21 training patterns that are uniformly distributed along the x axis and normally distributed along the f (x) direction with noise variance 1. The model is a two-node, feedforward network.
3.2 Regression With Recurrent Networks.
3.2.1 Data. For recurrent network modeling, we synthesized a data sequence of
x(t)
N samples with =
10
&j
-
the following dynamic function:
1)
(3.4)
(3.5) (3.6) (3.7) where t = 0.1,. . . . N - 1, E(t) is normally distributed with zero mean and variance u2, and a = 1 and b = 5 are two constants. Two dummy variables, yl(t) and y2(t), evolve from their previous values. In our test, we initialize yl(t) = y2(t) = 0.
Smoothing Regularizer for Recurrent Networks
473
Table 2: Comparison between the Recurrent Networks Trained with the Smoothing Regularizer and Those Trained with Standard Weight Decay for the Function Estimation Task.O Number of training patterns
Noise variance
With standard weight decay
With smoothing regularizer
11
0.1 0.5 1.o
0.035f0.006 0.123f0.008 0.151fO.OOO
0.020f0.002 0.067f0.007 0.111f0.015
21
0.1 0.5 1.0
0.014f0.001 0.058f0.002 0.095f0.004
0.009f0.000 0.037f0.001 0.071f0.001
41
0.1 0.5 1.o
0.009f0.000 0.032f0.001 0.057d~0.001
0.006f0.000 0.024f0.005 0.039f0.016
~~~~
"The results shown are averaged over 10 different initial weights.
3.2.2 Model. The model we have used for the above data is a twohidden unit, recurrent network with sigmoidal functions at hidden units and a linear function at a single output unit. The output of a hidden unit is time-delayed and fed back to another hidden unit input. It can be described as (3.8) (3.9) (3.10) where yl(t) and y2(f) correspond to the two hidden-unit outputs. The model overall has 9 weight parameters.
3.2.3 Results. The performance measure is the same as equation 3.3 in the case for feedforward networks. Table 2 lists the performances of the recurrent networks trained with standard weight decay and those with our smoothing regularizer. The table shows the results with the best value of regularization parameters. It again shows that, in all cases, the smoothing regularizer outperforms standard weight decay. For all networks listed in Table 2, the numbers of training patterns are varied from 11, 21, and 41. The noise variances are from 0.1, 0.5, and 1.0. The regularization parameters for both the smoothing regularizer and standard weight decay are varied from 0 to 0.1 with step-size 0.001.
474
Lizhong Wu and John Moody 3.3 Predicting the U.S. Index of Industrial Production.
3.3.1 Data. The Index of Industrial Production (IP) is one of the key measures of economic activity. It is computed and published monthly. Our task is to predict the 1-month rate of change of the index from January 1980 to December 1989 for models trained from January 1950 to December 1979. The exogenous inputs we have used include 8 time series such as the index of leading indicators, housing starts, the money supply M2, the S&P 500 Index. These 8 series are also recorded monthly. In previous studies by Moody etal. (1993),with the same defined training and test data sets, the normalized prediction errors of the 1-month rate of change were 0.81 with the neuz neural network simulator, and 0.75 with the proj neural network simulator.'"
3.3.2 Model. We have simulated feedforward and recurrent neural network models. Both models consist of two layers. There are 9 input units in the recurrent model, which receive the 8 exogenous series and the previous month IP index change. We set the time-delayed length in the recurrent connections 7 = 1. The feedforward model is constructed with 36 input units, which receive 4 time-delayed versions of each input series. The time-delay lengths are 1, 3, 6, and 12, respectively. The activation functions of hidden units in both feedforward and recurrent models are tai7l7 functions. The number of hidden units varies from 2 to 6. Each model has one linear output unit. 3.3.3 Trnrnrrzg. We have divided the data from January 1950 to December 1979 into four nonoverlapping subsets. One subset consists of 70% of the original data and each of the other three subsets consists of 10% of the original data. The larger subset is used as training data and the three smaller subsets are used as validation data. These three validation data sets are, respectively, used for determination of early stopped training, selecting the regularization parameter, and selecting the number of hidden units. We have formed 10 random training-validation partitions. For each training-validation partition, three networks with different initial weight parameters are trained. Therefore, our prediction committee is formed by 30 networks. The committee error is the average of the errors of all committee members. All networks in the committee are trained simultaneously and "'The neuz networks were trained using stochastic gradient descent, early stopping via a validation set, and the PCP regularization method proposed by Levin e t a / . (1994). The proj networks were trained using the Levenburg-Marquardt algorithm, and network pruning after training was accomplished via the methods described in Moody and Utans (1992). The internal layer nonlinearities for the neuz networks were sigmoidal, while some of the proj networks included quadratic nonlinearities as described in Moody and Yarvin (1Y92).
Smoothing Regularizer for Recurrent Networks
475
Table 3: Normalized Prediction Errors for the 1-Month Rate of Return on the U.S. Index of Industrial Production (Jan. 1980-Dec. 1989)."
+ Std
~~~~~
Model
Regularizer Mean
Median Max
Min
Committee
Recurrent networks
Smoothing Weight decay
0.646 f0.008 0.734 f 0.018
0.647 0.737
0.657 0.632 0.767 0.704
0.639 0.734
Feedforward networks
Smoothing Weight decay
0.700 f 0.023 0.745 f0.043
0.707 0.748
0.729 0.654 0.805 0.676
0.693 0.731
"Each result is based on 30 networks.
stopped at the same time based on the committee error of a validation set. The value of the regularization parameter and the number of hidden units are determined by minimizing the committee error on separate validation sets.
3.3.4 Results. Table 3 lists the results over the test data set. The performance measure is the normalized prediction error as used in Moody et al. (19931, which is defined as (3.11)
where S ( t ) stands for the observations, Q represents the test data set, and S is the mean of S ( t ) over the training data set. This measure evaluates prediction accuracy by comparing to a trivial predictor that uses the mean of the training data as its prediction. Table 3 also compares the out-of-sample performance of recurrent networks and feedforward networks trained with our smoothing regularizer to that of networks trained with standard weight decay. The results are based on 30 networks. As shown, the smoothing regularizer again outperforms standard weight decay with 95% confidence (in t-distribution hypothesis) in both cases of recurrent networks and feedforward networks. We also list the median, maximal, and minimal prediction errors over 30 predictors. The last column gives the committee results, which are based on the simple average of 30 network predictions. We see that the median, maximal, and minimal values and the committee results obtained with the smoothing regularizer are all smaller than those obtained with standard weight decay, in both recurrent and feedforward network models. Figure 3 plots the changes of prediction errors with the regularization parameter in recurrent neural network modeling. As shown by the figure, the prediction error over the training data set increases with the regularization parameter, the prediction errors over the validation and test data
476
Lizhong Wu and John Moody
sets first decrease and then increase with the regularization parameter. The optimal regularization parameter with the least validation error is 0.8 with our smoothing regularizer and 0.03 with standard weight decay. In all cases, we have found that the regularization parameters should be larger than zero to achieve optimal prediction performance. This confirms the necessity of regularization during training in addition to early stopped training. We have observed and compared the weight histogram of the networks trained with our smoothing regularizer and those with standard weight decay. As demonstrated in Figure 4, although the distribution has heavy tail, most weights parameters in the networks with the smoothing regularizer are more concentrated on small values. Its distribution is more like a real symmetric o-stable (SaS)" distribution rather than a gaussian distribution. This is also consistent with the soft weight-sharing approach proposed by Nowlan and Hinton (1992), in which a gaussian mixture is used to model the weight distribution. We thus believe that our smoothing regularizer provides a more effective means to differentiate "essential" large weights from "irrelevant" small weights than does standard weight decay.
3.3.5 With arid Withorif Earl!/ Stopping of Tkaitiirig. The results shown in Table 3 and Figures 3 and 4 are for networks trained with both the regularization and early stopping techniques. From Figure 3, we see that the prediction performance is far worse than the optimum if the network is trained with just early stopping and no regularization (A = 0). Another case is that the network is trained with regularization and without early stopping. We compare the performances between the networks trained with regularization and early stopping and the networks trained with regularization but without early stopping in Table 4. For the latter networks, those 10% of the data originally used for early stopping are now used in training. All other training conditions are the same for both cases. From the table, we see that the perfxmance of networks without early stopping is slightly worse than those with regularization and early stopping simultaneously. However, the difference is small in terms of the median or committee errors, even though the deviation of prediction errors ltas increased. 3.3.6 Sfnbility Analysis. I n Section 2, we found that equation 2.20 (i.e., -;l]Wi! < 1) must hold to ensure stability and that the effects of small input perturbations are damped out. Figure 5 shows the value of ;81/W// of the trained networks. The networks trained with the optimal regularization parameter satisfy the inequality, and those networks trained with regularization parameters, which are much larger or smaller than the optimal value, d o not satisfy the stability requirement. "See, for exdmple, Shao and Nikias (1993)
Smoothing Regularizer for Recurrent Networks
477
.....................................................................................
i
/-\
Quadraticweight Decay :
Vahdation
".'I .......'...... / .. ....:..... ..j 0.65 .......
0
0.01
................ . ...........................................
........;..... 0.02
:
..j..
....._.......; ....................
0.05 0.06 0.07 Regularization Parameter
0.03 0.04
0.08 0.09
......
0.1
Figure 3: Regularization parameter vs. normalized prediction errors for the task of predicting the U.S. Index of Industrial Production. The example given is for a recurrent network trained with either the smoothing regularizer (upper panel) or standard weight decay (lower panel). For the smoothing regularizer, the optimal regularization parameter that leads to the least validation error is 0.8 corresponding to a test error of 0.646. For standard weight decay, the optimal regularization parameter is 0.03 corresponding to a test error of 0.734. The new regularizer thus yields a 12% reduction of test error relative to that obtained using quadratic weight decay.
Lizhong Wu and John Moody
478
With Smoothing Regularizer
n
I
-'1
-0.8
-0.6
-0.4
-0.2
0
I
0.2
0.4
0.6
0.8
1
0.4
0.6
0.8
1
Weight W f i Standard Weight Decay 400 300
6 $200 ,m u. 100 0
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
Weight
Figure 4: Comparison of the weight histogram between the recurrent networks trained with our smoothing regularizer and those with standard weight decay. Each histogram summarizes 30 networks trained on the IP task. The smoothing regularizer yields a symmetric ct-stable (or leptokurtic) distribution of weights (large peak near zero and long tails), whereas the quadratic weight decay produces a distribution that is closer to a gaussian. The smoothing regularizer thus distinguishes more sharply between "essential" (large) weights and "nonessential" (near-zero-valued) weights. The near-zero-valued weights can be pruned.
4 Concluding Remarks and Discussion
Regularization in learning can prevent a network from overtraining. Several techniques have been developed in recent years, but all these are specialized for feedforward networks. To our best knowledge, a regularizer for a recurrent network has not been reported previously. We have developed a smoothing regularizer for recurrent neural networks that captures the dependencies of input, output, and feedback weight values on each other. The regularizer covers both simultaneous
Smoothing Regularizer for Recurrent Networks
479
Table 4: Comparison of Prediction Performance of the Networks Trained with and without Early Stopping of Training." Training
Mean f Std
Median
Max
Min
Committee
With early stopping Without early stopping
0.646 f 0.008 0.681 & 0.057
0.647 0.664
0.657 0.938
0.632 0.643
0.639 0.657
flResults given in the table are the normalized prediction errors for the IP task as those shown in Table 3. All the results are based on 30 recurrent networks. Whether trained with early stopping or not, the networks are both trained with the smoothing regularizer.
i4+
"I T
li 1.3
1.1
0.5 0.4
,.... . ........,........_ ,..._.....
I
'
0
0.5
J
1
1.5
Regularization Parameter
Figure 5: Regularization parameter vs. 711WIl of trained recurrent networks. The networks are trained to predict the U.S. Index of Industrial Production. For each regularization parameter, 30 networks have been trained. Each network has four hidden units. The smoothing regularizer and early stopping are both used during learning. From Figure 3, we know that the optimal regularization parameter for these networks is 0.8. This figure plots the mean values of y / /W1J of these 30 networks with the error bars indicating the maximal and minimal values. As shown, the networks with the optimal regularization parameter have y((W/I< 1. This confirms the networks' stability, in the sense that the network response to any input perturbation will be smooth.
480
Lizhong Wu and John Moody
and time-lagged recurrent networks, with feedforward networks and single layer, linear networks as special cases. Our smoothing regularizer for linear networks has the same form as standard weight decay. The regularizer developed depends on only the network parameters, and can easilv be used. A series of empirical tests has demonstrated the efficacy of this regularizer and its superior performance relative to standard quadratic weight decay. Empirical results show that the smoothing regularizer yields a real symmetric ti-stable (S(i S) weight distribution, whereas standard quadratic weight decay produces a normal distribution. We therefore believe that our smoothing regularizer provides a more reasonable constraint during training than standard weight decay does. Our regularizer keeps "essential" weights large as needed and, at the same time, makes "nonessential" weights assume values near to zero. We conclude with several additional comments. As described in equation 2.19, to bound the first derivatives of the activation functions in the hidden units, we have used their maximal value without considering different nonlinearities for different nodes and ignoring their changes with time. We have extended our smoothing regularizer to take into account these factors. Due to the page limit, we cannot include these extensions in this paper. In the simulations conducted in this paper, we have fully searched over the regularization parameter values by using a validation data set. This helps us observe the effect of the regularization parameter, but it is time consuming if the network and the training data set are large. Stability is another big issue for recurrent neural networks. There is a lot of literature on this topic, for example, Hirsch (1989) and Kuan ef al. (1994). In our derivation of the regularizer, we have found that ?\\Wil < 1 must hold to ensure the effects of small input perturbations are damped out. This inequality can be used for diagnosing the stability of trained networks as shown in Figure 5 . It can also be appended to our training criterion equation 2.17 as an additional constraint. Werbos and Titus (1978) proposed the following cost function for their pure robust model
(4.1)
In the model, w,was, in fact, a weight parameter in a feedback connection from the output to the input, but it was predefined in the range 0 < w,< 1 and kept fixed after being defined. Werbo and Titus's new cost function actually had a similar effect as our smoothing regularizer. A s they claimed, the biggest advantage of their new cost function was its ability to shift smoothly in different environments.
Smoothing Regularizer for Recurrent Networks
481
Appendix A: Neglecting the Cross-Covariance We neglect the cross-covariance term in equation 2.8 2 N
- C(Z(t) -F[@P,~(f)l){~[~P -l~~[ (@t )Pl> W ) l )
(A.1)
t=l
for two reasons. First, its expectation value will be small, and second, its value can be rigorously bounded with no qualitative change in our proposed training objective function in equation 2.9. Noting that the target noise E* is uncorrelated with the input perturbations E, and assuming that model bias can be neglected, the expectation value of equation A.l taken over possible training sets will be small:
Note here that @p are the weights obtained after training. In addition, the expectation value of equation A.l taken over the input perturbations E~ will be zero to first order in the (A.3)
Of course, many nonlinear dynamic systems have positive Lyapunov exponents, and so the second order and higher order effects in these cases cannot be ignored. Although its expectation value will be small, the cross-covariance term in equation 2.8 can be rigorously bounded. Using the inequality 2ab 5 a2 b2, we obtain
+
2 N -
N
C{z(t)- F[@P.I(t)]}{F[@.p.I ( t ) ]
-
F[@P-?(~)])
r=i
< -
c{Z(t) N l N
r=i
=
Dp
+l
+
- F[@P.w)l)2
-
N
c
l N
2
{ F [ @ P .I(t)l - F [ @ P . W)])
f=l
{F[@p.I(t)] - F[@p.j(t)]}2
(A.4)
t=l
Minimizing the first term Dp and the second term 1," Cf"=, {F[@.p.I ( t ) ]F [ Q p , j ( t ) ] } 2 in equation 2.8 during training will thus automatically decrease the effect of the cross-covariance term. Using this bound, instead of the small expectation value approximation, will in effect multiply the first two terms in equation 2.8 by a factor of 2. However, this amounts to an irrelevant scaling factor and can be dropped. Thus, our proposed training objective function equation 2.9 will remain unchanged.
482
Lizhong Wu and John Moody
Appendix B: Output Sensitivity of a Trained Network to Its Input Perturbation For a recurrent network of form given by equation 2.15:
Y(t) = f [WY(t - 7 ) f V X ( t ) ].
Z ( t ) = UY(t)
(B.1)
this appendix studies the output perturbationt2 2
1.
a;(!!)= Z ( t ) - Z(t)j/ in response to an input perturbation ( a t )=
(lii'(f)
2 -
(B.3)
X(f)i(
The output perturbation will depend on the weight parameter matrices U,V, and W. The sizes of the U, V, and W are No x N,,, N,, x Nil, and Nil x N,. The numbers of output, hidden, and input units are N,,, N,,, and N,, respectively. By expressing the inputs to the hidden units as an Nil-dimensional column vector
o(t)= [a,(!!). .
. *ON,,(t)]'
=
Wy(t - 7 ) + V X ( t )
(B.4)
and using the mean value theorem,13 we get
f[OU)]- f[O(t)]= f'[O*(t)][O(t)- O ( t ) ]
(B.5)
wheref[~(t)I=Cfi[ol(t)I.. . . . f N , , [ w , , ( f ) I ) ? . f [ o ( t ) I = C f 1 [ 0 1 ( f ) I , .. . , f N , , [ G N , , ( f ) l ) T and f'[O*(t)]is a diagonal matrix with elements {f'[O*(t)]),,= f,'[o;(t)]. f,'( ) is the first derivative off,( ) and min[6,(f). o,(t)] lo;(t) <_max[o,(t).o,(t)]. With Schwarz's inequality, the output disturbance can be expressed as
& f )=
2
IlUf[O(f)]- Uf[O(t)]l(
5 Y2llU1I2ilOit) - o(fil/2
(B.6)
For a feedforward network, O ( t )= VX(t), we obtain
d(t)L (rlPII I PI I ) 2 d ( f )
(B.8)
"The time varying quantities in e uations 8.2 and 8.3 should not be confused with their associated ensemble averages oz and oz defined in Section 2. 13See, for example, Korn and Korn (1968). Also, assume that all f,( ) are continuous and continuously differentiable.
4
Smoothing Regularizer for Recurrent Networks
483
We now consider the case of recurrent networks. A recurrent network usually satisfies the following dynamic f~nction:'~
TdO(f)
=
+ VX(t+
= Wf[O(t)] - O ( t )
dt where dO(t)/df
(€3.9)
7)
[do,(t)/dt,. . . ,doN,,(t)/dtlT. If we define
=
I1 O ( t )- O(t) 112
(B.lO)
then
dt
df
With equation B.9 and by assuming
dO(t) dt
dO(t) dt
=
(B.ll)
dt
>0
T
1 -{W [f [O(t)]- f[O(t)]]- [O(t)- O ( t ) ] 7
+ v [X(t+
T)
-X(t
+
(B.12)
TI]}
We get
dam dt -
-
-
2 [O(t) - - 0 ( t ) l T W[f [O(t)]- f[O(t)]] -{ T -
II
- O(t)
If
+ [O(t)- 0(t)lTV[ X ( t +
T)-
X(t +T)]}
(B.13)
Using the mean value theorem and Schwarz's inequality again, we obtain the following equations:
II [W- 0 ( t i l T W
(fmf[Wl}II 5 -
(B.14)
Tll~ll~:(t)
for the first term in the right-hand side of equation B.13 and
11
[O(t) - O(t)]'V [ X ( t + T ) - X ( t
+ T)] 11 I a,(t)llVIIo.(t +
7)
(B.15)
for the third term of equation 8.13. During the evolution process of the network, the input perturbation a,(t) is assumed to be constant or to change more slowly than a,(t). This is true when 7 is ~mal1.l~ Therefore, the a x ( t )is replaced by n.yin I4We can obtain O(t + r ) = W Y ( t )+ VX( t + r ) and Y(t) = f[O(t)] by substituting the following approximation into equation B.9: dO(t) dt
~
o(t+
7) -
O(t)
r
Here, we assume that r is small. Note that such a dynamic function has also been used to describe the evolution process of recurrent networks by other researchers, for example, Pineda (1988) and Pearlmutter (19891. lSSee footnote 6 for justification.
484
Lizhong Wu and John Moody
the following derivation. With equations 8.14 and B.15, equation B.13 becomes (B.16)
or (B.17)
due to oo(t)> 0. For notational clarity, define a
= y((W(1- 1,
and
b = JJVJI
(B.18)
so that equation 8.17 becomes
!?&
dt
5
+
[auo(t) bn,] T
(B.19)
Integration of equation 8.19 from t - 1 to t yields the solutions (B.20) (B.21)
One sees that oO(t ) depends on the current input perturbation ' T ~as well as its previous value a,(+ 1). n<,(t- 1) again depends on its previous values, so the current ao(t)is dependent on its all previous input perturbations and the whole evolution process of the network learning. One also sees, to ensure stability and that the effects of small input perturbations are damped out, the following inequality is required: R
< 0 or equivalently ?IlWll < 1
By replacing
'T,
(B.22)
(f) back, we can rewrite equation B.20 as (B.23)
The first term in the right-hand side is the zero-input response [when the second term is the zero-memory response [when o<,(t 1) = 01. If we can minimize the zero-memory response at every time step t , o c , ( 0 ) . ~ o (. l. ). . ao(t - 1) will all be small. Moreover, due to its monotonically decreasing response function, the zero-input response will damp out. Therefore, the zero-input response can be ignored and we can focus only on how to minimize the zero-memory response of a,(t). The zero-memory response of nO(t)in equations B.21 and 8.23 becomes o , ( t ) = 01 and
(B.24) (8.25)
Smoothing Regularizer for Recurrent Networks
0.2’ 0
0.1
0.2
0.3
0.4
0.5
0.6
485
0.7
0.8
0.9
I
1
gamma IlWll
Figure 6: Change of G T ( ~ ( I W Ias I ) the function of rllWll and defined by equation B.28
T.
G,(yJJWJJ)is
Substituting equations 8.24 and B.25 into equation B.6 aIong with the definitions of a and b in equation B.18, we obtain
(B.27) Figure 6 plots the function: (B.28) with yllWlj < 1 and 7 = 0.1.0.5.1,2. The figure depicts the effect of yI I Wl I and 7 on the regularizer. As shown, the regularizer becomes more and more sensitive to the change of recurrent weights as the time delay T decreases. When T M 0, ~ : ( t is ) bounded by (B.29)
486
Lizhong Wu and John Moody
When W = 0 and T = 0, the model becomes a feedforward network and the deduced form of equations B.26 and 8.29 with W = 0 and r = 0 is the same as equation B.8. Therefore, the forms for feedforward networks and single-layer, linear networks can also be expressed by equations B.26 and 8.29, as special cases. By defining (B.30) (B.31) w e obtain (8.32) (B.33) Therefore, the network output disturbance o:( t) to its input perturbation .,2(f) can be approximated by equations B.32 and B.33. This concludes the derivation of equations 2.17, 2.18, and 2.21.
Acknowledgments We would like to thank T. Leen and the reviewers for their valuable comments on our first manuscript, T. Rognvaldsson and S. Rehfuss for proofreading our revised manuscript, and F. Pineda of Johns Hopkins University for discussing equation B.9 with us. We also thank the other members of the Neural Network Research Group at OGI for their various suggestions.
References Abu-Mostafa, Y. 1995. Hints. Neural Comp. 7(4), 639-671. Amari, S. 1972. Characteristics of random nets of analog neural-like elements. lEEE Trans. Systems, Man Cybern. SMC-2, 643-653. Bishop, C. 1993. Curvature-driven smoothing: A learning algorithm for feedforward networks. l E E E Trans. Neural Networks 4(5), 882484. Bishop, C. 1995. Training with noise is equivalent to Tikhonov regularization. Neural Comp. 7(1), 108-116. Chauvin, Y. 1990. Dynamic behavior of constrained back-propagation networks. In Advances in Neural lnformation Processing Systems 2, D. Touretzky, ed., pp. 642-649. Morgan Kaufmann, San Mateo, CA. Elman, J. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Eubank, R. L. 1988. Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York. .
Smoothing Regularizer for Recurrent Networks
487
Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4(1), 1-58. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural networks architectures. Neural Comp. 7, 219-269. Grossberg, S. 1969. On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. 1. Stat. Phys. 1, 319-350. Hastie, T. J., and Tibshirani, R. J. 1990. Generalized Additive Models, Vol. 43 of Monographs on Statistics and Applied Probability. Chapman and Hall, London. Hirsch, M. 1989. Convergent activation dynamics in continuous time networks. Neural Networks 2(5), 331-349. Hoerl, A,, and Kennard, R. 1970a. Ridge regression: Applications to nonorthogonal problems. Technometrics 12, 69-82. Hoerl, A,, and Kennard, R. 1970b. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55-67. Hopfield, J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81, 3088-3092. Korn, G., and Korn, T., eds. 1968. Mathematical Handbook for Scientists and Engineers. McGraw-Hill, New York. Kuan, C., Hornik, K., and White, H. 1994. A convergence result for learning in recurrent neural networks. Neural Comp. 6(3), 420-440. Leen, T. 1995. From data distributions to regularization in invariant learning, Neural Comp. 7(5), 974-981. Levin, A. U., Leen, T. K., and Moody, J. E. 1994. Fast pruning using principal components. In Advances in Neural Information Processing Systems 6 , J. Cowan, G. Tesauro, and J. Alspector, eds., pp. 35-42. Morgan Kaufmann, San Mateo, CA. Marcus, C., and Westervelt, R. 1989. Dynamics of analog neural networks with time delay. In Advances in Neural Information Processing Systems 1, D. Touretzky, ed., Morgan Kaufmann, San Mateo, CA. Moody, J., and Rognvaldsson, T. 1995. Smoothing regularizers for feed-forward neural networks. Tech. Rep., Oregon Graduate Institute. Moody, J. E., and Utans, J. 1992. Principled architecture selection for neural networks: Application to corporate bond rating prediction. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds., pp. 683-690. Morgan Kaufmann, San Mateo, CA. Moody, J. E., and Yarvin, N. 1992. Networks with learned unit response functions. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds., pp. 1048-1055. Morgan Kaufmann, San Mateo, CA. Moody, J., Levin, U., and Rehfuss, S. 1993. Predicting the US. index of industrial production. In Proceedings of the 1993 Parallel Applications in Statistics and Economics Conference, Zeist, the Netherlands. Special issue of Neural Network World 3(6), 791-794. Nowlan, S., and Hinton, G. 1992. Simplifying neural networks by soft weightsharing. Neural Comp. 4(4), 473-493.
488
Lizhong Wu and John Moody
Pearlmutter, B. 1989. Learning state space trajectories in recurrent neural networks. Neionl Conip. 1(2), 261-269. Pineda, F. 1988. Dynamics and architecture for neural computation. 1. Complexit!/ 4, 216-245. Pineda, F. 1989. Recurrent backpropagation and the dynamical approach to adaptive neural computation. Ntwral Cornp. 1(2), 161-172. I’laut, D., Nowlan, S., and Hinton, G. 1986. Experinztvzts on Learriiiig hy Back Proyngntioii. Tech. Rep. CMU-CS-86-126,Carnegie Mellon University, Pittsburgh, PA. Poggio, T., and Girosi, E 1990. Networks for approximation and learning. I€€€ P ~ u c .78(9), 1481-1497. Powell, M. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithrs fur Approsim~tinri,J. Mason and M. Cox, eds. Clarendon Press, Oxford. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Pnn1llt4Distrihirted Processiiig: Explorotioiz in the Microsfriictiwe of Cogriitiori, D. Rumelhart and J. McClelland, eds., Chap. 8, pp. 319-362. MIT Press, Cambridge, MA. Scalettar, R., and Zee, A. 1988. Emergence of grandmother memory in feed forward networks: Learning with noise and forgetfulness. In Coiirrectionist Models atid Their Ivrplicatioiis: Rcmfirigs frorii Cogiiitiz~Science, D. Waltz and J. Feldman, eds. Ablex. Sejnowski, T. 1977. Storing covariance with nonlinearly interacting neurons. 1. Math. Bid. 4, 303-321. Shao, M., and Nikias, C. 1993. Signal processing with fractional lower order moments: Stable processes and their aplications. IEEE Proc. 81(7), 986-1010. SjGberg, J., and Ljung, L. 1992. Overtraining, regularization and searching for minimum in neural nets. In Preprint 4tk IFAC Syiiiposzzrm 017 Adaptizre Systems iiz Cuntrol a d Sigrid Processing, pp. 669-674. Sjoberg, J., and Ljung, L. 1995. Overtraining, regularization and searching for minimum with application to neural nets. IntrrnntioirnlJ.Control (submitted). Tikhonov, A. N., and Arsenin, V. I. 1977. Solirtioirs of Ill-Posed Problem. Winston, New York. Distributed solely by Halsted Press. Scripta series in mathematics. Translation editor, Fritz John. ‘Tresp, V., Hollatz, J., and Ahmad, S. 1993. Network structuring and training using rule-based knowledge. In Ad~wncesii7 Neurizl Iuforrrzntion Processing System 5, S. J. Hanson, j. D. Cowan, and C. L. Giles, eds., pp. 871-878. Morgan Kaufmann, San Mateo, CA. Data. CBMS-NSF Regional ConWahba, G. 1990. Spliize Models for Ohserzx~tio~zal ference Series in Applied Mathematics. Weigend, A., Rumelhart, D., and Huberman, B. 1990. Back-propagation, weightelimination and time series prediction. In Proceediugs ofthe Comecfionist Models Summer School, T. Sejnowski, G. Hinton, and D. Touretzky, eds., pp, 105116. Morgan Kaufmann, San Mateo, CA. Werbos, P. 1992. Neurocontrol and supervised learning: An overview and evaluation. In Handbook of Inteltzgetif Control, D. White and D. Sofge, eds., Van Nostrand Reinhold, New York.
Smoothing Regularizer for Recurrent Networks
489
Werbos, P., and Titus, J. 1978. An empirical test of new forecasting methods derived from a theory of intelligence: The prediction of conflict in Latin America. IEEE Trans. Syst. Man Cybern. SMC-8(9), 657-666.
Received June 2, 1994; accepted May 30, 1995.
This article has been cited by: 1. Chi Sing Leung, Hong-Jiang Wang, J Sum. 2010. On the Selection of Weight Decay Parameter for Faulty Networks. IEEE Transactions on Neural Networks 21:8, 1232-1244. [CrossRef] 2. D.T. Mirikitani, N. Nikolaev. 2010. Recursive Bayesian Recurrent Neural Networks for Time-Series Modeling. IEEE Transactions on Neural Networks 21:2, 262-274. [CrossRef] 3. J.P.-F. Sum, Chi-Sing Leung, K.I.-J. Ho. 2009. On Objective Function, Regularizer, and Prediction Error of a Learning Algorithm for Dealing With Multiplicative Weight Noise. IEEE Transactions on Neural Networks 20:1, 124-138. [CrossRef] 4. Chi-Sing Leung, J.P.-F. Sum. 2008. A Fault-Tolerant Regularizer for RBF Networks. IEEE Transactions on Neural Networks 19:3, 493-507. [CrossRef] 5. Illia Horenko, Evelyn Dittmer, Alexander Fischer, Christof Schütte. 2006. Automated Model Reduction for Complex Systems Exhibiting Metastability. Multiscale Modeling & Simulation 5:3, 802. [CrossRef] 6. Y. Xu, K.-W. Kwok-WoWong, C.-S. Leung. 2006. Generalized RLS Approach to the Training of Neural Networks. IEEE Transactions on Neural Networks 17:1, 19-34. [CrossRef] 7. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus] 8. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 9. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 10. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef] 11. Jyh-Da Wei, Chuen-Tsai Sun. 2000. Constructing hysteretic memory in neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:4, 601-609. [CrossRef] 12. Chi Sing Leung, G.H. Young, J. Sum, Wing-Kay Kan. 1999. On the regularization of forgetting recursive least square. IEEE Transactions on Neural Networks 10:6, 1482-1486. [CrossRef] 13. John Sum , Lai-wan Chan , Chi-sing Leung , Gilbert H. Young . 1998. Extended Kalman Filter–Based Pruning Method for Recurrent Neural NetworksExtended
Kalman Filter–Based Pruning Method for Recurrent Neural Networks. Neural Computation 10:6, 1481-1505. [Abstract] [PDF] [PDF Plus]
Communicated by Richard Lippmann
NOTE
Note on the Maxnet Dynamics John I? F. Sum Peter K. S. Tam Department of Elecfronic Engineering, Hong Kong Polytechnic University, Hung Hom, Kowloon
A simple method is presented to derive the complete solution of the Maxnet network dynamics. Besides, the exact response time of the network is deduced. 1 Introduction
Since the beginning of neural network research, the winner-take-all network has played a very important role in the design of learning algorithms, in particular, most of the unsupervised learning algorithms (Pao 1989)such as competitive learning, self-organizingmap, and the adaptive resonance theory model. Conventionally, an N-neurons winner-take-all (WTA) network is defined as follows: lim k,[~~(tj] =
t-w
1 0
if vi(0) > vj(0)for all j otherwise
#i
(1.1)
Many researchers have attempted to design and realize the WTA. In Lippman (1987), a discrete time algorithm called Maxnet is proposed. Maxnet is a fully connected neural network. Each neuron's output is positively fed back to its input and negatively fed back to other neurons' inputs. Suppose there are N neurons. We denote their state variables and outputs by vf(t)and k,, where i = 1 2. . . . ,N, respectively. The dynamics of the Maxnet is given by
(1.2) where 1
(1.3) N-1 and ki(x) is a function of x defined as following. For all i = 1.2. . . . .N, O < C < -
hj(x) =
x 0
ifx>O ifx
Neural Computation 8, 491-499 (1996)
(1.4) @ 1996 Massachusetts Institute of Technology
John P. F. Sum and Peter K. S. Tam
492
In compact form, equation 1.2 can be written as v(t + 1) = h [Av(t)]
(1.5)
wherev(t) = [ z ~ , ( t ) . z ~ ~.j .t .) . . u N ( f ) l T , and 1
A=
--t
- - t . . . -
1
...
f
--F
(1.6)
...
Without loss of generality, we introduce the index TI, ~ 2 .,. ., resent the neurons with initial state in ascending order, i.e.,
0 < ZJ,, < Zl,, <
.'
'
TN
to rep-
< ,,Z'
We assume that there are no two neurons having the same initial state. The convergence property of the Maxnet was first introduced by Lippman (1987). This may be stated as: the convergence of the Maxnet is assured if 0 < f < 1/N. Recently, Floreen (1991) and Koutroumbas and Kalouptsidis (1994) studied this intensively. In Koutroumbas and Kalouptsidis (19941, a general updating mode that allows partially parallel operation is also discussed. In both of their papers, worst case bounds on the response time of the network are derived. However, none of them has derived the complete solution of the dynamics for the network. This paper provides a complete solution and a geometric interpretation of network dynamics, which illuminates network behavior. I n this paper, we present the explicit solution of the network. This is based on an appropriate partitioning of the state space into regions in which the network behaves linearly. Similar techniques have been employed to obtain partial results for continuous-time WTA networks such as Perfetti (1990) and Dempsey and McVey (1993). Furthermore, the piecewise linear approach was used to approximate the dynamic behavior of the Hopfield network in some applications, such as in Gee ef al. (1993). In addition, based on linear modeling techniques, interesting properties have been derived for the Grossberg shunting network (Kosko 1992: pp. 94-99). A formula for calculating the exact response time of the network is deduced in the remainder of this paper. In the next section, some recent results on the properties of the Maxnet will be stated. Then the explicit solution of the network dynamic, based on eigensubspace analysis, is deduced in Section 3. Based on this explicit solution, the exact response time is also derived. Section 4 presents the conclusions. 2 Preliminary
In Floreen (1991) and Koutroumbas and Kalouptsidis (1994), some properties of the Maxnet have been studied and proven. Here, we state them without proof.
Maxnet Dynamics
493
Theorem 1.
Theorem 2 (Lemma 3 of Koutroumbas et al. (1994)). IfuTY(O)= uAI(0)for some j = 1 , 2 , .. . ,N- 1, then lim,-m k T N ( t = ) limf,= k,,(t) = 0. Theorem 3 (Theorem 1 of Koutroumbas e t al. (1994)). lim,,, foralli= 1.2. . . . . N - 1 .
h,(t)
=0
Theorem 4. a. limf+mk,,(t) exists and limf+mk,,(t) 2 0. Equality holds when the condition in Theorem 2 is satisfied. b. The limit of the network is attainable after a finite number of steps if
u,, (0) # uTi(0) W' # N. For simplicity, let us define the terms neuron settling time and network response time.
Definition 1. a, For all i = 1.2,. . . ,N- 1, T, is called the ith neuron settling time if h,,(t) = 0 for all t 2 T,. b. T,, is called the network response time if for all i k , ( t )= 0, and h,,(t 1) = k T N ( t for ) all t 2 T,,.
+
=
1,2,. . . .N - 1,
According to Theorem 4, we obtain the fact that 11) T,, < 03, (2) h , ( t + 1) = h,(t), for all i = 1,2,. . . ,N and for all t L Trt. To visualize the overall picture given by the above theorems and definitions, a simple example is presented below. Example. Suppose that a Maxnet consists of 3 neurons and their initial state variables are ~ ~ ( =0 7,) ~ ~ (=0 5,) and u3(0)= 9. Hence, 7r1 = 2, 7r2 = 1, and 7r3 = 3. We set E = 0.25. Figure 1 shows the values of uT,, uT2, and vT3as functions of t. From the graphs, we observe that the settling 'Lemma 1 of Floreen (1991) and Lemma 2 of Koutroumbas and Kalouptsidis (1994). 2Theorem 1 of Floreen (1991) and Theorem 3 of Koutroumbas and Kalouptsidis (1994).
John P. F. Sum and Peter K. S. Tam
494
. . .
9; \
\
.
8- +
.
\
\
0
Figure 1: Changes of time.
7’1
(solid line), z12 (dot-dash line), and ZQ (dash line) against
time of the neurons are 2 and 4 steps, and the network response time is 4 steps. Therefore, TI = 2, Tz = 4 and T,, = 4.
To proceed to our main result, we need the following corollary. Corollary 1.
..
a. I f u , , ( O ) # v,,(O)fornll i # j , then 0 < TI < T2 < . . ’ < TN-I = T,, < co. b. If u,,(O) = 71, (0)for nll i # N - 1, theii 0 < TI < . . . < T I = TI+, < . . . < TN-, = T,, < X. Proof of Corollary 1. (a) The proof is accomplished by method of contradiction. Suppose that TI+, < T I . Then we can establish that 17T,(f) = o,,(t) > z ~ = , + ~ (=f ) kre+,(t)= 0 at Tl+l < f < T,. But this contradicts Theorem 1. As a result, TI < TI+, for all i = 1 . 2 . . . . .N - 2. In addition to Theorem 4, we conclude that 0 < TI < T2 < . . < TN-1 = T,, < co. (b) Since u,,+,(t 1) - u,,(t + 1) = (1 + F)‘[U,,+,(~)- u , ( O ) ]for all t > 0, u, iT,) = 0 implies that z’,~+,(T,) = 0. By definition, TI = TI+,. Hence the proof is completed. 0
+
In the next section, we proceed to derive the solution of the Maxnet using the above corollary.
Maxnet Dynamics
495
3 Network Dynamics
The dynamics of the network during 0 < t < TI is given by VN(t
+ 1)= ANVN(t)
(3.1)
where vN(f) = [vT,(t),vRZ(t), . . . ,vR,(t)lT and AN is an N x N matrix with diagonal elements equal to 1 and off diagonal elements equal to - E . Then, during 0 < t < T I , the network may be regarded as a linear discrete-time time invariant system. The solution of this system can then be obtained by evaluating the eigenvalues and the eigensubspace of AN.
Lemma 1. TheeigenvaZuesofANare [l- ( N - ~ ) E and ] ( 1 + ~ )Thecorresponding . eigensubspace of [l - (N - l)~]and (1 + E ) are MN and M i , respectively, where
and Mi
=
{v E RNIvTelN= 0}
Proof. Let xi = 1/JN for all i, i.e., x E M N . Then
= [l -
(N - I)E]Xi
So that, [l - (N - l ) ~is]an eigenvalue for A N . Next consider w = v - (vTelN)elN, w E MN', (3.3)
i.e., ( A N w ) ~ = (1 + t)wi or A N w = (1 + E)W. Hence the other eigenvalue is (1 + F ) . Moreover, we can replace w by
(5,s,o,.. . ;o)T (&,o, 3,. . . ,o)T
John P. F. Sum and Peter K. S. Tam
496
Therefore, it can be concluded that [l - (N - I ) € ]and (1+ F ) are the only 0 eigenvalues of AN because dim(MN) dim(M,I) = N.
+
With the aid of Lemma 1, the solution of the dynamics equation 3.1 can be written as VN(f
+ 1)
=
[I - (N- ~ ) E ] [ v ; ( ~ ) ~ I N ] ~ I N
+ (1 f €1{ V N ( f ) That is to say, for all i
(3.4)
- [v;(t)elN]elN}
= 1 , 2 , .. . , N,
for all t < TI. It is the exact solution of Maxnet in the time interval 0 5 t 5 TI. Furthermore, the settling time of if. neuron is given by
Once ?-in, reaches zero, the corresponding output will also be zero. After T1, the network dynamics can be modeled in a lower dimensional space. There are two cases to be considered: (1) v,(T1) = 0 and ( 2 ) Z’,~(T~) > 0. For case (l),we can simply skip the time interval TI 5 t < T2 and proceed to consider the dynamics of the network in the time interval Tz 5 f < T3. In case of (2) we can denote that VN-l(t)
=
[ u q ( f ) ’ u 7 r l ( f ). ..
r
.-UT,(~)]
and consider the dynamics as VN-l(f
+ 1) = A ~ - l v ~ - l ( f )
Since A N - ] is defined in the same way as AN except that the dimension is N - 1, we can follow the same principle applied to the derivation of equation 3.5 and Lemma 1 to deduce that UT,(f)
for all i
==
[1 - (N- 2)f](‘-T’) (VN-I(TI)) (1 + f ) ( f - r l )[vn,(TI) - ( v N - I ( T l ) ) ]
+
= 2.3, . . .
N and
(3.7)
Maxnet Dynamics
497
for all TI 5 t < T2. Here
Repeating the same procedure, we can derive the general solution of the Maxnet for all time t 2 0. Denoting that V N - k ( t ) = [vR,+,(t).vR,+,(t)....~uT~(t)]T
the general solution of the network at time Tk 5 t < Tk+l is given by DT,(t)
=
[l - (N- k - 1 ) ~ ] ( ' - (~v~N) - ~ ( T ~ ) ) [v~,(Tk)- (VN-k(Tk))]
+ (1 + f ) ( f - T k )
(3.8)
for all i = k + l . k + 2 . . . . ,Nand (3.9)
V R , ( t )= 0
for all i = 1 . 2 , . . . .k, where
Besides, the settling time for .rrTd, recursively by
~ 5 . .~. .
Since D, ( t + 1) = uRN(t) whenever t is given by
neurons can be obtained
2 T N - ~ ,the network response time
where To = 0 and 1x1 is the smallest integer that is just greater than x. 4 Geometric Interpretation
Koutroumbas and Kalouptsidis (1994) present a brief geometric interpretation for two dynamic properties of the Maxnet: (1)once the initial state VN is located on the hyperplane that bisects the angles between the reference axis, the limit vector will be the null vector, and (2) otherwise, the
498
John P. F. Sum and Peter K. S. Tam
Figure 2: The geometric interpretation of the dynamics of the Maxnet. x.y.z correspond to three initial conditions that are located in three regions, A, B, and the line along tl. limit point will be on the axis that corresponds to the node TN. Essentially, these properties can be easily visualized from equations 3.8, 3.9, and Lemma 1. To simplify the discussion, we describe the case of two neurons, but the interpretation can be extended to N neurons. From Lemma 1, it is observed that the component of v2, which is parallel to ell, will decrease a t a rate (1 - f ) , while the component perpendicular to e12will increase at a rate of (1 + f ) . Figure 2 shows three situations, indicated by X , y, and z. x1 and y l are the components of x and y that are parallel to e, whereas, x2 and yz are the components perpendicular to e. The lengths of the arrows indicate the corresponding rates of change. Consider Y, i.e., in region A , the magnitude of the change of x along c is -ai,which is larger than F X ~ The . resultant change of x is pointing toward the axis v2. Similarly, the resultant change of x will point toward the u1 axis if x is located in the other A region. In region B, y1 2 y2. Equality holds only when y lies on the axis u1. Therefore, the change of y along the direction of e is also larger than that along the direction perpendicular to e. The resultant change of y is again toward one of the axes. Consider z, which is on the line of e; its resultant change is pointing toward (0.0). In summary, if u l ( 0 ) > ~ ( 0[v2(0) ) > q(O)], then the limit point will be on the v1 (u2)axis. If zll(0) = u2(0), then the limit point will be (0.0).
Maxnet Dynamics
499
5 Conclusion In this paper, we have derived the complete solution of the Maxnet. This solution provides an alternative approach to understanding the properties of Maxnet. Besides, the exact response time is also deduced as long as v,,(O) # vTN_,(0).Since our derivation of the solution is based on the method of eigensubspace analysis, the geometric interpretation of the network dynamics can be described vigorously. Such a technique can be readily adapted to the analysis of similar WTA networks such as Imax (Yen and Chang 1992), Gemnet (Yang et al. 1995), and Selectron (Yen et al.
1994).
Acknowledgment We would like to thank an anonymous referee for valuable comments.
References Dempsey, G. L., and McVey, E. S. 1993. Circuit implementation of a peak detector neural network. l E E E Transact. Circuits Systems-ll 40(9), 585-591. Floreen, I? 1991. The convergenceof hamming memory networks. l E E E Transact. Neural Networks 2(4), 449457. Gee, A. H., ef al. 1993. An analytical framework for optimizing neural networks. Neural Networks 6(11, 79-97. Kosko, B. 1992. Neural Networks and Fuzzy Systems. Prentice-Hall, Englewood Cliffs, NJ. Koutroumbas, K., and Kalouptsidis, N. 1994. Qualitative analysis of the parallel and asynchronous modes of the hamming network. l E E E Transact. Neural Networks 5(3), 380-391. Lippman, R. 1987. An introduction to computing with neural nets. I E E E ASSP Mag. 4, 4-22. Pao, Y. 1989. Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, Reading, MA. Perfetti, R. 1990. Winner-take-all circuit for neurocomputing applications. IEE Proc. Part G 137(5), 353-359. Yang, J., et al. 1995. A general mean-based iterative winner-take-all neural network. IEEE Transact. Neural Networks 6(1), 14-24. Yen, J.-C., and Chang, S. 1992. Improved winner-take-all neural network. Electronics Lett. 28(7), 662-664. Yen, J.-C., ef al. 1994. A new winners-take-all architecture in artificial neural networks. IEEE Transact. Neural Networks 5(5), 838-843.
Received January 3, 1995; accepted July 10, 1995.
This article has been cited by: 1. Xindi Cai, Danil V. Prokhorov, Donald C. Wunsch II. 2007. Training Winner-Take-All Simultaneous Recurrent Neural Networks. IEEE Transactions on Neural Networks 18:3, 674-684. [CrossRef] 2. Zhi-Hong Mao, Steve G. Massaquoi. 2007. Dynamics of Winner-Take-All Competition in Recurrent Neural Networks With Lateral Inhibition. IEEE Transactions on Neural Networks 18:1, 55-69. [CrossRef] 3. Xiaohui Xie , Richard H. R. Hahnloser , H. Sebastian Seung . 2002. Selectively Grouping Neurons in Recurrent Networks of Lateral InhibitionSelectively Grouping Neurons in Recurrent Networks of Lateral Inhibition. Neural Computation 14:11, 2627-2646. [Abstract] [PDF] [PDF Plus] 4. Heiko Wersing , Jochen J. Steil , Helge Ritter . 2001. A Competitive-Layer Model for Feature Binding and Sensory SegmentationA Competitive-Layer Model for Feature Binding and Sensory Segmentation. Neural Computation 13:2, 357-387. [Abstract] [PDF] [PDF Plus]
Communicated by Alain Destexhe
NOTE
Optimizing Synaptic Conductance Calculation for Network Simulations William W. Lytton Department of Neurology, University of Wisconsin, Wm. S. Middleton VA Hospital, 1300 University Ave., MSC 1720, Madison, W153706 USA High computational requirements in realistic neuronal network simulations have led to attempts to realize implementation efficiencies while maintaining as much realism as possible. Since the number of synapses in a network will generally far exceed the number of neurons, simulation of synaptic activation may be a large proportion of total processing time. We present a consolidating algorithm based on a recent biophysically-inspired simplified Markov model of the synapse. Use of a single lumped state variable to represent a large number of converging synaptic inputs results in substantial speed-ups. 1 Introduction
The computational demands of a single synapse in realistic neural simulations can equal the cost of several neuronal units in an artificial neural network. In particular, Markov models of synaptic activation are dynamic systems that may have 10 or more state variables. An alternative, the classical ”alpha function” model (Rall1967), is computationally cheap but lacks obvious biophysical correlates at the channel level (Destexhe, Mainen, and Sejnowski 199413). Recently, a middle ground has been developed that preserves some major aspects of a biophysically realistic full Markov model at considerably less computational cost (Wang and Rinzel 1992; Destexhe et al. 1994a,b). Destexhe and co-workers demonstrated a minimal two-state model with a fundamental biophysical verisimilitude that used a simple implementation practical for network use. We will call this the ”DMS model” after the authors’ initials. Individual synapses in neuronal networks are generally represented as distinct entities that alter a conductance in the postsynaptic neuron after detecting some signal, typically voltage or calcium crossing a threshold, in the presynaptic neuron. This representation is widely used in the synaptic packages available with the major realistic neural simulators. All of the synapses of a single type are doing identical, potentially redundant, calculations, albeit at slightly different times. Srinivasan and Chiel (1993) previously demonstrated how multiple alpha functions could be consolidated by representing their summation in an iterated closed form. Neural Computation 8, 501-509 (1996)
0 1996
Massachusetts Institute of Technology
William W. Lytton
502
We present a similar consolidating algorithm that allows an efficient implementation of large numbers of DMS synapses. Rather than treating each synapse individually, we will lump all of the synapses of a given type (e.g., GABAA or AMPA), converging onto a single compartment of one model neuron. These will then be represented by consolidated state variables and a single synaptic conductance and synaptic current. In the following, ”single synapse” or ”individual synapse” is used to describe the basic two-state DMS model. “Complex synapse” describes the lumped synapse model. Lower-case state variables and conductance (r.8) will be used for the former and upper-case ( R , C ) for the latter. 2 The DMS Algorithm
The first-order kinetic scheme was introduced by Destexhe et nl. (1994a). The notation has been slightly modified for simplicity of description of the subsequent algorithms. (2.1) This model of a ligand-gated ion channel is comparable to the standard Hodgkin-Huxley parameterization for voltage-sensitive ion channels. The difference is that the (r and 13 parameters are not functions of voltage. Instead n is taken as a simple function of transmitter concentration: (2.2) Transmitter concentration C is assumed to be given by a square wave of amplitude 1 and duration Cdur. 13 is a constant. Following the Hodgkin-Huxley notation, the kinetic scheme can be expressed as a first order differential equation that solves for r in terms of R , and T R in the usual way: TRL = R, - r
R,
=
0 ~
a
+ /j
(2.34 (2.3b)
The update rule derived from the analytic solution for a single time step At is ~
R X,(1 - e - A t / ~ ~ + ) rt,-Af/~~
(2.4)
Synaptic Conductance for Network Simulations
503
(Note that this rule defines r in terms of itself, connoting the update step that would be used in software implementation.) The full rule is needed only when transmitter is present, since when C = 0, R, = 0 and TR = I//?. Equation 2.4 can be split into two update rules using Y = r,, or r = roff depending on the presence or absence of transmitter C. r,, = R,(1 - e - A t / T R ) r,ff = roff e-pAt
+ rone-Af/TR
(C > 0) (C= 0)
(2.5a) (2.5b)
Note that r,, and r,ff are not the same as rope, and r,lose from equation 2.1 but are both components of r,,. Synaptic conductance gsynand current isynare defined in the usual way:
3 Summing DMS Synapses
Summing synaptic activations makes it possible to maintain and update two rather than N state variables for the N synapses of a single type converging onto a given cell. An added advantage of this method is that it permits us to maintain a single queue of spike arrival times (now more accurately a heap) instead of N queues. The former improvement results in saving CPU time and the latter in saving both time and memory. 3.1 Separate Summations Required for R,, and R,,+ We cannot use the single update rule of equation 2.4 since this would require summing over different rm, and r., depending on the presence or absence of transmitter at the ith synapse. However, the two rules 2.5a and 2.5b have known factors that can be precalculated and brought out from under the summation. Again splitting r into r,, and r,ff, we then simply sum across No, and N,ff synapses, respectively, where the N total synapses have been partitioned depending on their status.
(3.1b) 1=1
i=1
;:x
x?;
Using X,, = r,,,, and Rolf = C:: r,ff, and noting that X, = N,,R,, we can simplify 3.la and 3.lb to produce update rules for unsubscripted Rs: (3.2a) (3.2b)
William W. Lytton
504
This form is identical to the single synapse update rules (2.5a and 2.5b) except that the forcing function for Ron has been multiplied by No,, the number of active synapses. These two update rules 3.2a and 3.2b form a compact two-step inner loop that the complex synapse executes at every time step. 3.2 Modifying Ron and RoffWhen Individual Synapses G o on or off. Updating the summed synaptic state variables on each time step saves the computational cost associated with updating individual Y,S. However, since Ron and Roff are complementary state variables that follow the respective rise and decay of multiple single synapses, these single synapse r,s are still needed. When a single synapse changes state from off to on, Ron must be augmented by the corresponding Y, and R,,ff decremented. Conversely, when an individual synapse turns from on to off, Ron must be decremented and Roffaugmented by the appropriate amount. In addition, the value of No, in equation 3.2a must be incremented by 1 (offkon) or decremented by 1 (on+off). Keeping track of individual r, values is easily done. Since these state variables are independent of voltage, they can be projected out in time from the last (opposite direction) transition (Destexhe e f al. 1994a): y,
=
R,(l
y,
=
r,e-fi's'
- ~ - C d u r / ~ R+ ) rle-Cdur/rR
(turning off)
(3.3a)
(turning on)
(3.3b)
These are identical to equations 2.5a and 2.5b except that time interval At has been replaced by Cdur (duration of transmitter release) in the first case and by ISI, the interspike interval, in the second. Note that while C d u r will remain constant during a simulation, IS1 = f - ( t o + C d u r ) where to is the time of last activation. Thus, IS1 is the interval from the end of synaptic activation to the beginning of the next activation for the same individual synapse. 3.3 Handling Different Maximal Conductances. We now have a way of calculating the state variable R = Ron + Roff. We can calculate a conductance G from this as we did in equation 2.6a or else calculate the components of G = Go, Gaff individually:
+
(3.4a) (3.4b) (3.4c) The foregoing analysis assumes that the individual synapses all have identical conductances. This will generally not be the case. To handle
Synaptic Conductance for Network Simulations
505
varying g,s, we need to expand 3.4b and 3 . 4 ~in the same manner as previously (cf. equations 3.la and 3.lb):
I=1
1=1
c,
We divide through by and change variables to create a new state Nor1 f roff, variable r: = (gl/c)rl.If we now redefine Ron = EL! Y& Roff = C,=, and No, = ~ l ( g l / ~we) ,arrive back at equations 3.2a and 3.2b. The change in the definition of No, is the only one that affects the implementation. The previous No, = C, 1.0 since each individual synapse had identical magnitude 1. Now, instead of incrementing or decrementing by 1 when turning an individual synapse on or off, we simply add or subtract the appropriate g,/c. Making our new state variable a proportion rather than a conductance is done for convenience and to maintain the Hodgkin-Huxley tradition of dimensionless state variables. The new state variable is described by a slight variation in equation 2.1. In the usual convention, rclosed = 1- Yopen. With this modification r&sed = (gl/c) - ropen. The dimensionless state variable is also useful for managing simulations. With treated as a simulation-wide global parameter, equation 3.4a gives the user the ability to globally alter the strength of a particular neurotransmitter by reducing This is analogous to the common experimental the corresponding practice of dumping transmitter antagonists into the bath in vitro or giving antagonists systemically in viva
c.
4 Maintaining a Single Queue
Simulating delay is necessary because most simulations do not include axons. Therefore the delay encompasses both the time taken for an action potential to proceed down the axon (axonal delay) as well as the typically shorter time required for transmitter to diffuse across the synaptic cleft and bind. Handling delays requires maintenance of a queue, a data structure that always disgorges its oldest element (first-in, first-out). Typically, the time of a presynaptic activation is added to the appropriate delay and then stored on a queue. When this time is reached in the simulation, the item is removed from the queue, and the postsynaptic element is activated. Since many individual synapses are now maintained as a single complex synapse, it is natural to consider maintaining a single queue instead of N queues. The queue must now store not only the times of synaptic activation but also an index indicating the specific individual synapse.
506
William W. Lytton
4.1 Managing the Queue from the Presynaptic Side. In the direct object-oriented approach to the synapse, the synapse manages its own initiation by constantly checking the presynaptic cell for a signal, generally the passage of voltage above a predetermined threshold. The consolidation of signal management in a single queue would require an array of such presynaptic pointers. The alternative, maintaining a presynaptic array of postsynaptic pointers, is far more efficient: access across the pointer is required only when spikes occur instead of on each time step (Bower and Beenian 1994). Such forward pointers are particularly important in implementations on multiprocessor computers where pointer access between different CPUs is relatively slow (Niebur rt 01. 1991). Using forward pointers, the queue receives its input from a structure associated with the presynaptic neuron. When triggered, this structure writes a time stamp equal to current time plus the appropriate delay. Presynaptic identity is also written in the form of an index. The queue is read postsynaptically when time reaches the value of the next queue time. The postsynaptic mechanism is then altered by moving the corresponding r, from R,,f to Ron. Because a complex synapse has a single Cdur,the queue can serve double duty and signal not only the initiation of the synapse but also its termination. For this reason, the queue time is not removed but is instead incremented by Cdur. The queue is implemented with two heads: the first head gives the time for initiating another synapse while the second head gives the time for terminating one. Each time is associated with an index that indicates exactly which individual synapse is being started or stopped.
4.2 A Heap Qua Queue Handles Differing Synaptic Delays. Individual synapses may have different delays. If these synapses share the same queue, an individual synapse with a relatively long delay could activate presynaptically shortly before one with a relatively short delay. This would put a later time on the queue in front of an earlier time. The synapse associated with the earlier time would be activated only after the later time was removed from the queue. To avoid this problem, a heap is used in lieu of a queue. Items in a heap are maintained in numerical order. A traditional heap implementation involves a binary tree (Knuth 1973). In the present case, items arrive out of order relatively rarely and are usually not very far out of order, making a binary tree unnecessary. Instead, the item is checked when it arrives, with the appropriate heap location readily found by forward search when needed. Further consolidation is possible by creating a single master heap for all synapses of a given type. Each heap entry must then include not only an index for the presynaptic mechanism but also one for the postsynaptic mechanism. The algorithm must take account of postsynaptic mechanism number as well as time in maintaining heap order.
Synaptic Conductance for Network Simulations
507
5 Simulation Results
Benchmark simulations were run in NEURON (Hines 1993) on Sun SPARClOs under SunOS 4.1.3 and Intel Pentiums under Linux. Figure 1 shows results comparing individual single synapse evaluation with the complex summated synapse. In Figure 1B the summation of DMS state variables (C ri, dashed line) is compared to the single Ron calculated using the present algorithm (solid line). The lines do not coincide because the complex synapse algorithm includes weighting for the different 8s. Figure 1C compares conductances for the two schemes. Benchmarking demonstrated a 3-fold speed-up with the current algorithm. Extending the simulation to 200 fully interconnected mutually excitatory neurons receiving similar input and spiking at approximately 12 Hz demonstrated a 45-fold speed-up. A more complex simulation with 225 excitatory and inhibitory neurons was also benchmarked. Individual neurons had two compartments and 8-9 voltage- and /or calcium-sensitive conductances. Connectivity was nearly complete with a total of 50,400 synapses and the average firing rate was approximately 20 Hz. No attempt was made to optimize either the old or new synapse model by determining ideal queue lengths; instead, conservative values were used. With the synapse model presented here, core memory usage was reduced 38% from 8 to 5 Mb. CPU time was reduced by 96% from 38 hr 50 min to 1 hr 35 min. 6 Discussion
Simulations of neuronal networks quickly fall victim to the perils of combinatorics. While the calculation time required for simulating individual neurons increases proportionally with number of neurons n, the number of synapses can increase up to n2 depending on convergence. Specifically, the number of synapses S can be given either by the product of convergence and number of postsynaptic cells S = C . Post or by the product of divergence and number of presynaptic cells S = D . Pre. Percent convergence, C/Pre, expressed in terms of number of synapses is S/(Post . Pre). This is equal to percent divergence: D/Post = S/(Pre . Post) (Traub et al. 1987). Calling this term percent connectivity (pij), a network with N cell types and nj cells of each type will have number of synapses S given by
i=l j=1
Depending on the complexity of the single neuron model chosen, time spent in synaptic computations can readily outrun the time spent modeling the neurons themselves. This will be particularly true in parallel implementations if pointers are not managed carefully, as noted above.
William W. Lytton
508
A
ms 30-
20
~
10
0
1
0
2
0
3
0
4
0
5
0
ms
__
Figure 1:
Figure 1: Comparison of DMS model synapse (dashed line) with the complex synapse (solid line). Randomized synaptic input was used to drive both synapse models. Individual 2 values ranged systematically from 0.1 to 5 pS while delays ranged from 0 to 25 msec. (A) Six of the 50 presynaptic spike trains used as input to the two synapse models. The bottom 5 traces are the most rapidly spiking and the top 1 trace is the least rapidly spiking cell. Spike trains were produced with a Poisson generator using the gen.mod presynaptic spike generator written by Zach Mainen. (B) Comparison of summed state variables for the two models: Crl (dashed line) vs. R (solid line). The former is dimensionless while the latter is in pS. (C) Comparison of summed conductance (in jrS) for the two models: Crig, (dashed line) vs. RC (solid line). The curve for the complex synapse is identical to that in B since G = 1. Although not apparent here, the superposition is imperfect due to time-step round-off differences between the two implementations.
Synaptic Conductance for Network Simulations
509
The consolidated implementation presented here extends the value of the original DMS synapses in reducing this computational load.
Acknowledgments
I would like to thank Jack Wathey, Mike Hines, and Alain Destexhe for many helpful virtual discussions and the two anonymous reviewers for useful comments and corrections. Scott Deyo ran some of the benchmarks. This work was done using the NEURON simulator with support from NINDS and the Veterans Administration.
References Bower, J., and Beeman, D. 1994. The Book of Genesis. Springer-Verlag,New York. Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. 1994a. An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Cornp. 6, 14-18. Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. 1994b. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Cornp. Neurosci. 1, 195-230. Hines, M. 1993. NEURON-A program for simulation of nerve equations. In Neural Systems: Analysis and Modeling, F. H. Eeckman, ed., pp. 127-136. Kluwer Academic Press, Boston, MA. Knuth, D. 1973. The Art of Computer Programming Vol. 3: Sorting and Searching. Addison-Wesley, New York. Niebur, E., Kammen, D. M., Koch, C., Ruderman, D., and Schuster, H. G. 1991. Phase coupling in two-dimensional networks of interacting oscillators. In Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 123-129. Morgan Kaufmann, San Mateo, CA. Rall, W. 1967. Distinguishing theoretical synaptic potentials computed for different somadendritic distributions of synaptic inputs. J. Neurophys. 30, 11381168. Srinivasan, R., and Chiel, H. J. 1993. Fast calculation of synaptic conductances. Neural Cornp. 5,200-204. Traub, R. D., Miles, R., and Wong, R. K. S. 1987. Models of synchronized hippocampal bursts in the presence of inhibition. I. Single population events. J. Neurophys. 58, 739-751. Wang, X. J., and Rinzel, J. 1992. Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Cornp. 4, 84-97.
Received May 26, 1995; accepted August 15, 1995
This article has been cited by: 1. Samuel A. Neymotin, Kimberle M. Jacobs, André A. Fenton, William W. Lytton. 2010. Synaptic information transfer in computer models of neocortical columns. Journal of Computational Neuroscience . [CrossRef] 2. Rogerio R. L. Cisi, André F. Kohn. 2008. Simulation system of spinal cord motor nuclei and associated nerves and muscles, in a Web-based architecture. Journal of Computational Neuroscience 25:3, 520-542. [CrossRef] 3. William W. Lytton, Ahmet Omurtag, Samuel A. Neymotin, Michael L. Hines. 2008. Just-in-Time Connectivity for Large Spiking NetworksJust-in-Time Connectivity for Large Spiking Networks. Neural Computation 20:11, 2745-2756. [Abstract] [PDF] [PDF Plus] 4. Romain Brette, Michelle Rudolph, Ted Carnevale, Michael Hines, David Beeman, James M. Bower, Markus Diesmann, Abigail Morrison, Philip H. Goodman, Frederick C. Harris, Milind Zirpe, Thomas Natschläger, Dejan Pecevski, Bard Ermentrout, Mikael Djurfeldt, Anders Lansner, Olivier Rochel, Thierry Vieville, Eilif Muller, Andrew P. Davison, Sami El Boustani, Alain Destexhe. 2007. Simulation of networks of spiking neurons: A review of tools and strategies. Journal of Computational Neuroscience 23:3, 349-398. [CrossRef] 5. Michelle Rudolph, Alain Destexhe. 2006. Analytical Integrate-and-Fire Neuron Models with Conductance-Based Dynamics for Event-Driven Simulation StrategiesAnalytical Integrate-and-Fire Neuron Models with Conductance-Based Dynamics for Event-Driven Simulation Strategies. Neural Computation 18:9, 2146-2210. [Abstract] [PDF] [PDF Plus] 6. Romain Brette. 2006. Exact Simulation of Integrate-and-Fire Models with Synaptic ConductancesExact Simulation of Integrate-and-Fire Models with Synaptic Conductances. Neural Computation 18:8, 2004-2027. [Abstract] [PDF] [PDF Plus] 7. William W. Lytton , Michael L. Hines . 2005. Independent Variable Time-Step Integration of Individual Neurons for Network SimulationsIndependent Variable Time-Step Integration of Individual Neurons for Network Simulations. Neural Computation 17:4, 903-921. [Abstract] [PDF] [PDF Plus] 8. Jan Reutimann , Michele Giugliano , Stefano Fusi . 2003. Event-Driven Simulation of Spiking Neurons with Stochastic DynamicsEvent-Driven Simulation of Spiking Neurons with Stochastic Dynamics. Neural Computation 15:4, 811-830. [Abstract] [PDF] [PDF Plus] 9. Michele Giugliano . 2000. Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network SimulationsSynthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations. Neural Computation 12:4, 903-931. [Abstract] [PDF] [PDF Plus] 10. Michele Giugliano , Marco Bove , Massimo Grattarola . 1999. Fast Calculation of Short-Term Depressing Synaptic ConductancesFast Calculation of Short-Term
Depressing Synaptic Conductances. Neural Computation 11:6, 1413-1426. [Abstract] [PDF] [PDF Plus] 11. J. Köhn , F. Wörgötter . 1998. Employing the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network SimulationsEmploying the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network Simulations. Neural Computation 10:7, 1639-1651. [Abstract] [PDF] [PDF Plus] 12. Paul Bush, Nicholas Priebe. 1998. GABAergic Inhibitory Control of the Transient and Sustained Components of Orientation Selectivity in a Model Microcolumn in Layer 4 of Cat Visual CortexGABAergic Inhibitory Control of the Transient and Sustained Components of Orientation Selectivity in a Model Microcolumn in Layer 4 of Cat Visual Cortex. Neural Computation 10:4, 855-867. [Abstract] [PDF] [PDF Plus] 13. M. L. Hines, N. T. Carnevale. 1997. The NEURON Simulation EnvironmentThe NEURON Simulation Environment. Neural Computation 9:6, 1179-1209. [Abstract] [PDF] [PDF Plus] 14. Alain Destexhe. 1997. Conductance-Based Integrate-and-Fire ModelsConductance-Based Integrate-and-Fire Models. Neural Computation 9:3, 503-514. [Abstract] [PDF] [PDF Plus]
Communicated by Laurence Abbott
Parameter Extraction from Population Codes: A Critical Assessment Herman P. Snippe' University of Stirling, Department of Psychology, Stirling FK9 4LA, Scotland U.K.
In perceptual systems, a stimulus parameter can be extracted by determining the center-of-gravity of the response profile of a population of neural sensors. Likewise at the motor end of a neural system, center-of-gravity decoding, also known as vector decoding, generates a movement direction from the neural activation profile. We evaluate these schemes from a statistical perspective, by comparing their statistical variance with the minimum variance possible for an unbiased parameter extraction from the noisy neuronal ensemble activation profile. Center-of-gravity decoding can be statistically optimal. This is the case for regular arrays of sensors with gaussian tuning profiles that have an output described by Poisson statistics, and for arrays of sensors with a sinusoidal tuning profile for the (angular) parameter estimated. However, there are also many cases in which center-of-gravity decoding is highly inefficient. This includes the important case where sensor positions are very irregular. Finally, we study the robustness of center-of-gravity decoding against response nonlinearities at different stages of an information processing hierarchy. We conclude that, in neural systems, instead of representing a parameter explicitly, it is safer to leave the parameter coded implicitly in a neuronal ensemble activation profile. 1 Introduction
Structure can be coded in many ways. We briefly discuss three coding systems for a parameter a. More possibilities, e.g., based on the temporal characteristics of the neural response, certainly exist (e.g., Geisler ef aI. 1991; Hopfield 1995; Konig et aI. 1995; Middlebrooks ef al. 1994; Oram and Perrett 1992). 1. Make a sensor that has a response strength R that grows monotonically with a: R = R(a). In the most simple case the function R(a) *Present address: University of Groningen, Department of Biophysics, Nijenborgh 4, 9747 AG Groningen, The Netherlands.
Neural Cornpufufion 8, 511-529 (1996)
@ 1996 Massachusetts Institute of Technology
512
Herman I? Snippe
is the identity: R = a . We call this iiiterisity codirig for the parameter a (Fig. la). 2. Divide parameter space into a large number of discrete cells, and make a sensor for each of these cells. Then a is coded by the identity of the responding sensor (Fig. lb). It is ubiquitous in computers, e.g., the pixellation of an image. 3. Use sensors with graded and overlapping sensitivity profiles. The coarse value of the parameter a is now carried by the idcrztity of the responding sensors, but for a precise determination of a we have to invoke the relative iriterisity with which each of these sensors responds. This is called a p o p u l a t i o ~codiriy for the parameter a (Fig. lc). Population codes are ubiquitous in biological systems. For instance, in the visual system, Iocal pattern spatial frequency and orientation, motion speed and direction, and binocular disparity are all strongly believed to be population coded according to our definition. Many more examples exist in other modalities, for instance the coding of the location of a sound source using neurons selectively tuned for different values of interaural time and intensity differences (e.g., Konishi 1991, 19931, the coding of target distance and speed in bat echolocation using neurons tuned for echo delay and Doppler shift (e.g., Altes 1989; Olsen 1992; Olsen and Suga 1991a,b), and the coding of olfactory and gustatory stimuli (e.g., Rolls 1989). Population coding is not restricted to sensory modalities. It has been shown that neurons in motor cortex have graded and overlapping activity profiles for (intended) movement direction (Georgopoulos et al. 1988; see Lee ct 01. 1988; McIlwain 1991 for examples in the superior collicuius). Also it is conceivable that population coding is used in cognition (Poggio and Girosi 1990; Young and Yamane 1992). Because of the ubiquity of population coding in neural systems, it is important to know its capacities. What is the precision with which the parameter a is present in the neural ensemble response profile? Are there simple ways to extract a from the neuronal activation profile? This extraction of a is not trivial since a is iiiiplicif only in the ensemble of sensor responses. A solution to the parameter extraction problem is to use the ccizfer of gravity of the neuronal activation profile as the estimate of a (Baldi and Heiligenberg 1988; Lee et al. 1988; Zohary 1992). Denoting the parameter tuning of the nth neuron in the ensemble by a,,and the actual response of this neuron by R,,, the center-of-gravity (CG) estimate ~ C C of , a is
Note that equation 1.1 makes intuitive sense. Each neuron promotes its tuning label, and is allowed to d o so in proportion to its response to the actual stimulus. The denominator in 1.1 is a normalization factor that
Parameter Extraction from Population Codes
513
11 I I I I I1 I I I1 I1 I1 I I1 I I B I I I I I I I I I I I1 I I I I fa
I
1
2
3
4
5
6
Figure 1: Three possibilities to code the value of an environmental parameter a. (a) Intensity coding; the intensity of response of a single sensor suffices to code the value of a. (b) Labeled line coding; the identity of the responding sensor codes the value of a. For a precise evaluation of a many sensors are needed. (c) Population coding by sensors with graded and overlapping sensitivity profiles. This represents a balance between the extremes of intensity coding and labeled line coding. A value of a indicated by the arrow would yield a strong response from sensor 4, a moderate response from sensors 3 and 5, and no response from the other sensors in the array. Thus, the value of a is coded partly by the identity of the responding sensors, and partly by the intensity of their response.
Herman P. Snippe
514
deconfounds influences due to stimulus intensity. Center-of-gravity decoding can be easily extended to a multidimensional parameter a (Georgopoulos et al. 1988; Lee ef al. 1988; McIlwain 1991; Salinas and Abbott 1994; Sanger 1994; Zohary 1992), or reformulated as an extraction of a population vector (Seung and Sompolinsky 1993; Vogels, 1990). For clarity we restrict our discussion to the one-dimensional form of equation 1.1. Previous studies have looked in detail into systematic deviations of i i c ~from the true value a (e.g., Baldi and Heiligenberg 1988; Sanger 1994). These deviations are very small provided that the sensors sample the parameter space sufficiently dense (Baldi and Heiligenberg 1988), and that the distribution of the sensor locations a,, is sufficiently homogeneous (see Sanger 1994 for a more precise statement). Thus for many systems systematic errors in the center-of-gravity estimate will tend to be small. However, in real neural systems the sensor responses are noisy. Hence the sensor responses R,, in equation 1.1 should be treated as random variables. From statistics we know that the fidelity of an estimate as in equation 1.1 is measured not only by its systematic deviations (bias), but also by its random deviations (variance). Of the possible unbiased estimators of a, the one that has the lowest variance is to be preferred. In this paper we compare the variance of the center-of-gravity estimate with that of the statistically optimal, minimum variance unbiased estimate based upon the same channel-coded data. Our analytical results complement simulation studies by Salinas and Abbott (1994). For additional analytical results see Seung and Sompolinsky (1993). Equation 1.1 has usually been studied under the assumption of system linearity. However, nonlinearities in neural information processing abound (e.g., Abbott 1994; Douglas and Martin 1991). Thus it is crucial that a proposed parameter extraction scheme is robust under nonlinearities at different stages in the information processing hierarchy. In section 5 we evaluate this robustness for the CG estimate 1.1. 2 Efficiency of CG Estimation Is Low for Sharply Tuned Sensors Perturbed by Background Noise
In this section we study the model of Baldi and Heiligenberg (1988): A regular array of 2h4 + 1 sensors with unit spacing between consecutive sensors. The parameter tuning profiles of the sensors are gaussian with width rr; for the nth sensor
Contrary to the treatment by Baldi and Heiligenberg, we assume that the actual sensor outputs R,, are noisy:
R,, = QII(n)
- WI,
(2.2)
Parameter Extraction from Population Codes
515
Throughout this paper, we assume that the noise W is uncorrelated between sensors (see Snippe and Koenderink 1992b for a treatment of noise correlations). In this section we also assume that the noise is gaussian, and that the noise variance fl is independent of (expected) sensor response Qn. We call this situation background noise. Although the statistics of real neural noise are closer to Poisson (Softky and Koch 1992), a study of the effects of background noise is nevertheless relevant when sensor response (2.2) is a modulation superimposed on a spontaneous (noisy) neural activity level, as is the case for retinal ganglion cells (Robson and Troy 1987). The extraction of the stimulus parameter a from the ensemble response {R,} can be formulated as a problem in statistical estimation theory (Deutsch 1965): Generate an estimator f that operates on the ensemble response { R , 2 } to yield an estimate u of the actual value of the parameter u: = f (R-M,. . . , RM)
(2.3)
Generally, the quality of an estimator is measured using two quantities: 1. Its bias, representing any systematic deviations of u from u. 2. Its variance, representing the random errors in u.
Baldi and Heiligenberg (1988) show that the center-of-gravity estimation scheme 1.1 is virtually bias free (when sensor tuning width is not much smaller than the spacing between consecutive sensors). However, to fully assess the statistical quality of the estimate 1.1, we have to compare its variance with the statistically optimal, unbiased minimum variance estimator. In statistics there is a well-known lower bound on the variance Vur(u) of any unbiased estimate of u. This Cram&-Rao bound is given by (Deutsch 1965; Paradiso 1988)
in which E[...] is the statistical expectation operation, and p({R,,} I a ) is the probability distribution function of the neuronal ensemble response. Using the gaussian nature of the noise statistics W,, in 2.2, one easily calculates (2.5) and (2.6)
Herman P. Snippe
516
The prime on Q:,(a) indicates a differentiation with respect to a. Thus the Cram&-Rao bound 2.4 is (2.7) Substituting the explicit expression 2.1 for Q,,(a)in 2.7, and replacing the summation by an integration (an excellent approximation for 2 1, Baldi and Heiligenberg 1988), it is easy to evaluate
Hence the Cramer-Rao bound for our model system is
Note that the minimum attainable variance increases with the sensor tuning width 0 (Seung and Sompolinsky 1993; Snippe and Koenderink 1992a). Thus to attain a precise estimate of il, it is preferable to have sensors with low 0, i.e., with sharp tuning. For the present model system, the Cramer-Rao bound is attained for the maximum likelihood estimator (Snippe and Koenderink 1992a; Snippe 1996). Thus we use the right-hand side of equation 2.9 as a benchmark against which to compare the variance of the center-of-gravity estimate 1.1. We now proceed to calculate this variance. We concentrate on the variance produced by the numerator I,, n,,R,, in 1.1, and neglect the contribution to the variance due to the normalization El,R,,, which we treat as noise free. It can be shown that this does not affect our conclusions. Thus (2.1 0)
For our regular sensor array
3
Assuming that
(T
II,, =
n, thus the numerator in 2.10 equals
(2.11)
2 1, and that n is well within the sensor tuning range
Parameter Extraction from Population Codes
517
[ - M . . . MI, the denominator in 2.10 equals
(-i(y)*) 2
=
[l:exp
dx]
Thus equation 2.10 yields (2.13)
Contrary to the variance of the maximum likelihood estimator, the variance of the center-of-gravity estimate decreases as a function of sensor tuning width U . Hence, for background noise, it is advantageous to use broadly tuned sensors in a center-of-gravity estimation scheme (Seung and Sompolinsky 1993). Baldi and Heiligenberg (1988) arrived at a similar conclusion for the systematic deviations (bias) of the center-of-gravity estimate. Using the Cram&-Rao bound as a benchmark, the center-of-gravity estimate has an efficiency (2.14)
If the sensor array extent 2M is large relative to the sensor tuning width, the statistical efficiency of the center-of-gravity estimate 1.1 is low. For instance if M = 10a, equation 2.14 yields an efficiency of about 0.01, i.e., the variance of the center-of-gravity estimate is about 100 times the variance of the statistically most efficient estimate. Why is center-of-gravity estimation so dramatically inefficient under the conditions studied in this section? The reason is the combination of two circumstances: 1. The sensor array extent is large relative to the sensor tuning width. Thus most sensors in the array are not at all responsive to a stimulus with a specific value for the parameter u. 2. Sensor noise is response independent. Thus sensors that are not at all stimulated do generate noise. The inefficiency of the center-of-gravity (CG) estimate is due to the fact that in equation 1.1 contributions of all sensors are summed (in proportion to their position label in the array). This includes the contributions of sensors that show little response to the stimulus, but that do contribute noise. Note that these sensors also contribute in the OLE (optimum linear estimation) scheme of Salinas and Abbott (19941, since for the regular
Herman P. Snippe
518
sensor array studied in this section, OLE is essentially identical to the CG estimate 1.1 (with the denominator in 1.1 replaced by a suitable normalization constant). This behavior of CG/OLE decoding is opposed to that of the maximum likelihood (ML) estimator. In an ML estimation scheme the sensor outputs enter the decision with a weight equal to their differential sensitivity to small variations of the parameter a around its actual value (Snippe and Koenderink 1992a; Snippe 1996). Since sensors with tuning labels a,, that are far removed from a remain unresponsive to the stimulus under small variations of a around its actual value, these sensors do not enter the ML decision variable. Hence, contrary to centerof-gravity (and OLE) decoding, ML estimation is not affected by noise in the unresponsive sensors. 3 Efficiency of CG Estimation Is High for Poisson Noise or Broadly Tuned Sensors
In the previous section we showed that center-of-gravity estimation is highly inefficient for ensembles of sharply tuned sensors that are perturbed by response independent noise. Here we show that if either of these circumstances is relaxed, the center-of-gravity estimate 1.1 can actually be ideal, i.e., have 100% efficiency. 3.1 Poisson Noise. First we study a case in which sensor response noise variance vanishes for unstimulated sensors. This is true if sensor noise is governed by the Poisson distribution:
(3.1) In this expression we assume that the actual sensor response R,, for the nth sensor is a whole number, e.g., the number of spikes generated by the nth sensor in the observation interval. The Poisson distribution is relevant, since its variance equals its mean Q,,(a), approximately true for cortical neurons that show little spontaneous activity (Dean 1981; Softky and Koch 1992; Tolhurst et a!. 1983; Vogels et a!. 1989). We now calculate the ML estimate for the parameter R when sensor response noise is governed by 3.1. As is well known, the ML estimate can be obtained by differentiating the logarithm of the response probability distribution function with respect to the parameter of interest:
=
3 In Rll!+ R,, In c,i,I
(a) -
QIl
(3.2)
Parameter Extraction from Population Codes
519
Again we assume that the sensor tuning profiles are gaussian, and that the sensor array is regular (we will study nonregular arrays in the next section). Then the ratio Qi,(a)/Qll(a)equals ( n - a)/.’. When the sensor array is sufficiently dense, X I ,Q:,( a ) equals zero (Baldi and Heiligenberg 1988), and equation 3.2 reduces to
Setting this expression to zero yields the ML estimator U M L of a (3.4)
This result is identical to the center-of-gravity estimate 1.1 for a regular sensor array. Since the ML estimator (3.4) is essentially unbiased (Baldi and Heiligenberg 1988), it is the minimum variance estimate (Deutsch 1965). Thus under the circumstances studied in this subsection: 0 0
a regular array, of sensors with gaussian tuning functions, with output noise governed by Poisson statistics,
center-of-gravity estimation is optimal from a statistical point of view. How closely does center-of-gravity estimation still approximate ideal when these conditions are relaxed?
C, QL(a) is still zero, the regularity of the sensor array is not essential since the ML estimate, as evaluated from equation 3.2, is identical to equation 1.1. In general, however, as we show in the next section, sensor array irregularities will lead to a reduction of the efficiency of the center-of-gravity estimate. 2. Small amounts of background noise injections can have a large influence on the structure of the ML estimator (Geisler 1984). However if the Poisson component of the noise is still the dominant contribution to the variance of the center-of-gravityestimate, it will remain close to ideal. How much background noise can be tolerated depends on the size of the sensor array, i.e., the parameter M / o from equation 2.14. Large arrays tolerate less background noise. 3. The derivation presented works only for gaussian tuning functions. The impairment in the efficiency of the center-of-gravity estimate depends critically on the behavior of the tuning function in its skirts, with shallow skirts yielding large impairments. For instance, for Cauchy tuning functions, C,(a) = l/[(a - n)’ + b’], that have very shallow skirts, we calculate that the efficiency of the center-of-gravity estimator approaches zero for a large sensor array ( M / b >> 1). 1. Provided that
520
Herman P. Snippe
3.2 Broadly Tuned Sensors. In Section 2 we showed that for sensors perturbed by background noise, the statistical efficiency of centerof-gravity estimation is low when the sensors are sharply tuned, i.e., when M / o >> 1. A careful evaluation o f the integrals in Section 2 shows that when the sensors are broadly tuned ( M / n < 11, the efficiency of population coding approximates 1. This is a general result that is not restricted to gaussian tuning functions (Seung and Sompolinsky 1993). For instance, if the tuning functions are cosinusoidal, i.e., if sensor output is described by a projrctioii of a stimulus vector on a sensor sensitivity vector, vector decoding (a variant of the center-of-gravity scheme 1.1) can be shown to be statistically optimal for a sufficiently homogeneous distribution of sensor sensitivity vectors (Salinas and Abbott 1994; Sanger 1994). 4 Sensor Position Irregularities: Another Noise Source for
Center-of-Gravity Estimation Up to now we have studied regitlar sensor arrays. Although near-perfect regular arrays exist in biological systems (e.g., the human foveal photoreceptor array, Hirsch and Miller 1987; the cricket cercal system, Miller etnl. 1991), deviations from regularity often occur. How do such irregularities affect the performance of the center-of-gravity estimation scheme? To study this question, we analyze a system in which the actual sensor tuning locations n,, are perturbed from the positions a,, = IZ they would have in a regular array. We describe the perturbations using a gaussian probability density function: (4.1) The (root-mean-square) perturbation size s describes the degree of irregularity of the sensor array. For s small relative to the average spacing between consecutive sensors (s << 1) the sensor array is nearly regular, for s >> 1 the array is very irregular. We assume that the sensors are identical with respect to tuning width, response strength, and noise variance. This is not very realistic biologically, but actually such variations between sensors would have effects very similar to the effects of location perturbations studied here. See Vogels (1990) and Zohary (1992) for simulation studies of these effects. We evaluate the impairment of the population coding estimate 1.1 due to the sensor location perturbations for sensors with outputs perturbed by Poisson noise (equation 3.1). Note that by using the actual (perturbed) sensor locations a,, in 1.1 instead of their mean it , we assume that the estimator fully knows the sensor array irregularities (Bossomaier et al. 1985; Theunissen and Miller 1991). See Ahumada (1991) for learning models that accomplish this.
Parameter Extraction from Population Codes
521
We rewrite equation 1.1 as
This shows that the actual value a of the environmental parameter is not critical for our results, hence we lose little generality by assuming u = 0, which we do for convenience. For reasons of symmetry it is then obvious that the numerator and denominator in the right-hand side of equation 4.2 are statistically independent, and that the statistical expectation of the numerator equals zero. Thus, the variance of the center-of-gravity estimate 4.2 is (4.3) Note that expectation operations €[. . .] are to be evaluated over the sensor response distribution P ( R n 1 a ) of equation 3.1 and over the sensor position label distribution A ( a n ) of equation 4.1. Schematically: (4.4)
In the Appendix we show that the variance factor in 4.3 can be rewritten as
(4.5) The first term in the right-hand side of equation 4.5 generates the part of the variance of UCG due to the variance of sensor response. This is easy to see if we evaluate the variance operation in equation 4.3 while keeping the sensors fixed at positions an, instead of describing them using a probability model as in equation 4.1. For such fixed (albeit irregular) sensor positions, equation 4.3 yields
Thus only the contribution of the first term in the right-hand side of equation 4.5 occurs if we fix the sensor array irregularity and take into account only the variance due to the noise in the sensor outputs. Therefore we interpret this as the contribution to the variance of the center-of-gravity estimate due to the noise in the sensor outputs. What, then, is the second term in equation 4.5? Though a fixed, irregular array generates only the variance of the first term of equation 4.5, it will, in general, also generate a bias in the center-of-gravity estimate of a. The second term in equation 4.5 is simply the mean squared bias in the estimate of a due to the
522
Herman P. Snippe
irregularity of the sensor array. In this section we interpret this term as a variance due to the randomness in the sensor positions. We assume, as in equation 2.1, that the sensor tuning profiles QJ1 are gaussian, but with an extra free parameter R,, that represents the sensor sensitivity to the stimulus; thus (4.7) For this Qjlthe remaining expectation operations over sensor position in equation 4.5 can be calculated explicitly. When replacing the summations over 11 by integrations, the calculations are straightforward (though lengthy), and yield:
The term 23,iZR,,1 is the contribution to the variance due to the noise in the sensor responses; the remaining term is due to the randomness in the sensor positions. To judge the relative contributions of these two terms, we need a realistic estimate of R,,,. An estimate R,,, = 25 follows from observed spiking frequencies in well stimulated cortical neurons (circa 100 spikes/sec) and a reaction/integration time of 250 msec (Vogels 1990). A similar estimate for R,, would follow from results on the actual noise variance of cortical neurons when we realize that R,, = R$JRmaX is the (quadratic) signal-to-noise ratio of the most vigorously responding neurons in our model (Tolhurst et nl. 1983; Vogels et 01. 1989). For a perfectly regular sensor array (s = O), the factor in curly brackets in 4.8 equals 0, and the sole contribution to the error in i 7 is~ due ~ to the sensors' response noise. For s > 0, there is a contribution due to sensor array irregularities, quadratic in s for s << I? (Theunissen and Miller 1991). Both variance terms in 4.8 yield equal contributions to VII~(&G if )(1 - (1 + s 2 / m 2 ) - 3 ' 2 } = 23'2R;:,, which works out as s z 0.30 for our estimate R,,, zz 25. Note that this critical sensor array irregularity is determined by the value of s relative to o (the sensor tuning width), and not by the value of s relative to the (average) sensor spacing (unity in our model). Also note that the results above do not depend on the density with which the neuronal ensemble samples the parameter space. For a denser sampling, both variance terms in 4.8 decrease (in proportion to the sampling density), but their ratio remains constant. This result is not confined to the model system studied in this section. For instance, we calculated that also in the system studied by Sanger 1994 (sensors
Parameter Extraction from Population Codes
523
having cosinusoidal tuning profiles with background output noise), the ratio of the contribution to the variance in the center-of-gravity estimate due to sensor output noise and to sensor position noise does not depend on the sensor density. For a very irregular sensor array (s + m), the factor in curly brackets in 4.8 equals 1, and the ratio of the variance due to the sensor position randomness and that due to the sensor response noise equals 2-3/2R,,,. Note that this ratio is finite, even for a highly irregular sensor array. Nevertheless, the ratio is large; for R,,, = 25 it equals nearly 10. This means that for a typical cortical sensor ensemble, if the sensor labels are highly random (though perfectly known to the estimator at a next level), by far the largest contribution to the error in the center-of-gravity estimate 1.1 is due to the sensor array irregularities, and not due to the sensors’ response noise. Would the statistically optimal estimator suffer to the same extent from irregularities in the sensor positions? Actually, no. The maximum likelihood estimator, as derived from equation 3.2, would take into account irregularities in the sensor array through the function EnQ’,(a). Certainly the variance of the ML estimate will increase in regions of the sensor array where, by chance, sensor density is somewhat low and decrease in regions where the sensor density happens to be relatively high (Bossomaier ef al. 1985). However, when averaged over the whole array these fluctuations in the variance of the ML estimates will tend to cancel provided that the average sampling density is sufficiently high so that “gaps” in the sensor distribution where an evaluation of a would be impossible do not occur. Thus center-of-gravity estimation (and also the related vector decoding scheme) suffers from sensor array irregularities in a way that ML estimation does not. Recently, linear estimators have been proposed that do not suffer from such irregularities (Gaal 1993; Salinas and Abbott 1994; Sanger 1994). Although linear, these estimators do come at a price. An attractive feature of center-of-gravity estimation is that the estimator only needs to know the tuning values a, of the sensors. However, in these more general linear schemes one needs the overlap (correlation) of the sensor tuning functions to invert a covariance matrix. Thus at least a rough idea of the form of the sensor tuning functions is needed. However, this information can alternatively be used to implement (an approximation of) an ML estimation scheme. In Snippe (1996) I discuss a two-stage implementation of such an ML scheme, both stages being linear in the sensor responses (for approximately gaussian noise statistics). The first stage consists of a parallel parameter likelihood evaluation using a discrete grid of likelihood detectors. The second stage is a likelihood interpolation using weights determined by the first stage output. Thus I prevent the matrix inversions needed in the above-mentioned linear schemes by using different linear weights for different parts of the parameter range (Seung
524
Herman P. Snippe
and Sompolinsky 1993), instead of the global linear weights used in these fully linear schemes.
5 System Nonlinearities: Consequences for the CG Estimate
___
Although perception has been profitably modeled using linear systems analysis (Westheimer 1986), nonlinearities abound in biological systems (e.g., Abbott 1994; Douglas and Martin 1991). Thus it is important to assess the robustness of parameter extraction against nonlinearities at various stages in a proposed estimation scheme. In this section we study this robustness for the center-of-gravity (CG) estimation scheme 1.l. First, suppose that the individual sensor outputs R,, pass through a nonlinearity k: R,, - K,, = k(RII). Suppose further that the system is described by the regular array studied in Section 2, so that equation 1.1 yields an unbiased estimate of R when evaluated using the untransformed outputs R,, (Baldi and Heiligenberg 1988):
We rewrite this as E[C,,(i7- a)R,,]= Cll(rr- a ) Q , , ( a )= 0. Now, when the actual, nonlinearly transformed sensor outputs K,, = k(R,,) are used by the system evaluating the center-of-gravity estimate, this will still yield an unbiased estimate of a if E[C,,(n- n)K,,] = C,,(iz- a)H,,(a) = 0, in which H,,(a) is defined as E ( k ( R , , ) ] .If Q,,(R) is symmetric around ii - 0, H,,(n)also will be symmetric around tz - a, and thus the sum C,,(najH,,(a), when evaluated as an integral, will yield zero. Thus under these conditions the center-of-gravity scheme, when applied to the nonlinearly transformed sensor outputs K,, still yields an unbiased estimate of II. Deviations from this result due to tuning functions that are not perfectly symmetric, finite sensor sampling, etc. are modest, hence we conclude that center-of-gravity estimation is robust against nonlinearities in the outputs of individual sensors in the array. Now, however, suppose that the siii?imatio~zsin equation 1.1 pass through a nonlinearity F. This results in an estimate
(5.2) that can be very different from the original estimate 1.1. This can be shown by rearranging equation 1.1 as a c C ~ R,, = ~ a , , R , ,thus ,
(5.3)
Parameter Extraction from Population Codes
525
A simple example of the inequality occurs if the nonlinearity F is a power function: F ( x ) = xy. Then
In this special case the deviation could be easily corrected by inserting the inverse function F-'(y) = y'/? at the system's output end. Also for a general nonlinearity F, a suitable inverse operation on the center-ofgravity estimate 5.2 would again lead to a bias-free estimate. However, it is not at all clear how this could be accomplished when confounding parameters are present, say when the stimulus contains an (unknown) intensity parameter I that affects the sensor responses. In this case, for a general nonlinearity F, the form of the inverse operation depends on the (unknown) value of the confounding parameter I. Thus we conclude that in general nonlinearities in the summations in 1.1 generate estimates a that deviate from the physical value a in ways that cannot be easily corrected. Why do these deviations arise? Because the center-of-gravity evaluation 1.1 is basically a transformation from a population coding for the parameter a to an intensify coding for this parameter. These coding systems have been defined in the Introduction. Intensity-coded information is much more prone to uncorrected nonlinearities than population-coded information. This argues againsf the explicit evaluation of quantative parameters in a neural system, since they are much better protected against the inevitable nonlinearities when they remain in an implicit, populationcoded, form. See Lehky and Sejnowski (1990), Fetz (1992), and Salinas and Abbott (1995) for similar conclusions. 6 Conclusions
Although the center-of-gravity extraction scheme 1.1 is simple and elegant, our present analysis casts doubt on its applicability in neural systems. Firstly we showed that quite often the statistical efficiency of the center-of-gravity extraction scheme will be low. It could be argued that this is not a problem since parameter estimation thresholds, as observed in experimental psychophysics, are consistent with a center-of-gravity scheme using realistic numbers of sensors (Vogels 1990; Zohary 1992). Thus perhaps biological sensory systems do not have to operate near the 100% efficiency level demanded by the Cramer-Rao bound for their owners to survive and replicate. However, empirical determinations of efficiency on an absolute scale (using externally generated noise) are very high in human vision (Burgess and Colborne 1988; Parish and Sperling 1991; see also Pelli 1990 for the case that photon noise is the sole external noise source). Thus perceptual systems certainly can be excellent
Herman P. Snippe
526
statisticians, and a simple center-of-gravity extraction scheme may be insufficient to explain such high efficiencies. Another problem for center-of-gravity estimation (or indeed for any intensity coding scheme) is its sensitivity to system nonlinearities. This may help explain the ubiquity of population coding in neural systems, since, as we showed, such distributed representations are much more robust against nonlinearities. Given these problems, we doubt if center-of-gravity estimation is a realistic description of information extraction in biological systems. On the other hand center-of-gravity estimation has been suggested as a method for experimental neuroscientists to decode neural activity from relatively few recorded neurons (Salinas and Abbott 1994). Our analysis should help experimentalists judge when the center-of-gravity decoding scheme 1.1 is a realistic way to analyze their data, and when it may be necessary to use more sophisticated methods such as maximum likelihood or Bayesian estimation to gauge the information present in a neural system.
Appendix: Proof of Equation 4.5 Remember that for our choice n slightly simplifies the proof:
r
= 0,
by symmetry, EICllnl1Rl1]= 0, which
1
r
r
1
1
The step from A.l to A.2 follows from the Poisson nature of the sensor response noise. The last step follows from the assumption of statistical independence for the sensor location perturbations in equation 4.1.
Parameter Extraction from Population Codes
527
Acknowledgments
This research was supported by a grant from the EC Human Capital Mobility Fund. References Abbott, L. A. 1994. Decoding neuronal firing and modelling neural networks. Quart. Rev. Biophys. 27, 291-331. Ahumada, A. J. 1991. Learning receptor positions. In Computational Models of Visual Processing, M. S. Landy and J. A. Movshon, eds., pp. 23-34. MIT Press, Cambridge, MA. Altes, R. A. 1989. An interpretation of cortical maps in echolocating bats. I. Acoust. Soc. Am. 85(2), 934-942. Baldi, P., and Heiligenberg, W. 1988. How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biol. Cybern. 59, 313-318. Bossomaier, T. R. J., Snyder, A. W., and Hughes, A. 1985. Irregularity and aliasing: Solution? Vision Res. 25, 145-147. Burgess, A. E., and Colborne, B. 1988. Visual signal detection. IV. Observer inconsistency. J. Opt. SOC.Am. AS, 617-627. Dean, A. F. 1981. The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res. 44, 437440. Deutsch, R. 1965. Estimation Theoy. Prentice-Hall, Englewood Cliffs, NJ. Douglas, R. J., and Martin, K. A. C. 1991. Opening the grey box. TrendsNeurosci. 14, 286-293. Fetz, E. E. 1992. Are movement parameters recognizably coded in the activity of single neurons? Behav. Brain Sci. 15, 679-690. Gaal, G. 1993. Population coding by simultaneous activities of neurons in intrinsic coordinate systems defined by their receptive field weighing functions. Neural Networks 6, 499-515. Geisler, W. S. 1984. Physical limits of acuity and hyperacuity. J. Opt. SOC.Am. A l , 775-782. Geisler, W. S., Albrecht, D. G., Salvi, R. J., and Saunders, S. S. 1991. Discrimination performance of single neurons: Rate and temporal-pattern information. J , Neurophysiol. 66, 334-362. Georgopoulos, A. P., Kettner, R. E., and Schwartz A. B. 1988. Primate motor cortex and free arm movements to visual targets in three-dimensional space. 11. Coding of the direction of movement by a neuronal population. J. Neuuosci. 8(8), 2928-2937. Hirsch, J., and Miller, W. H. 1987. Does cone positional disorder limit resolution? J. Opt. Soc. Am. A4, 1481-1492. Hopfield, J. J. 1995. Pattern recognition computation using action potential timing for stimulus representation. Nature (London) 376, 33-36. Konig, P., Engel, A. K., Roelfsema, P. R., and Singer, W. 1995. How precise is neuronal synchronization? Neural Comp. 7, 469485.
528
Herman P. Snippe
Konishi, M. 1991. Deciphering the brain’s codes. N e r d Comp. 3, 1-18. Konishi, M. 1993. Neuroethology of sound localization in the owl. J. Comp. Physiol. A173, 3-7. Lee, C., Rohrer, W. H., and Sparks, D. L. 1988. Population coding of saccadic eye movements by neurons in the superior colliculus. Nntitre (London) 332, 357-360. Lehky, S. R., and Sejnowski, T. J. 1990. Neural network model of visual cortex for determining surface curvature from images of shaded surfaces. Proc. R. SOC.Lolid. B 240, 251-278. McIlwain, J. T. 1991. Distributed spatial coding in the superior colliculus: A review. Vis. Neurosci. 6, 3-13. Middlebrooks, J.C., Clock, A. E., Xu, L., and Green, D. M. 1994. A panoramic code for sound location by cortical neurons. Science 264, 842-844 Miller, J. P., Jacobs, G. A., and Theunissen, F. E. 1991. Representation of sensory information in the cricket cercal sensory system I. Response properties of the primary interneurons. 1.Neirrophysiol. 66, 1680-1689. Olsen, J . E 1992. High-order auditory filters. Current Opinion Nmrobiol. 2, 489497. Olsen, J. E, and Suga, N. 1991a. Combination-sensitive neurons in the medial geniculate body of the mustached bat: Encoding of relative velocity infor65, 1254-1 274. mation. 1. Ne~tt~ph!/~Iiol. Olsen, 1. F., and Suga, N. 1991b. Combination-sensitive neurons in the medial geniculate body of the mustached bat: encoding of target range information. I. Nritroph!ysio/. 65, 1275-1295. Oram, M. W., and Perrett, D. I. 1992. Time course of neural responses discriminating different views of the face and head. J . Neurophysiol. 68, 70-84. Paradiso, M. A. 1988. A theory tor the use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cyhevn. 58, 35-49. Parish, D. H., and Sperling, G. 1991. Object spatial frequencies, retinal spatial frequencies, noise, and the efficiency of letter discrimination. Visioiz Res. 31, 1399-1 41 5. Pelli, D. 1990. The quantum efficiency of vision. In Vision: Coding nnd Effiiciency, C. Blakemore, ed., pp. 3-24. Cambridge University Press, Cambridge. Poggio, T., and Girosi, E 1990. Networks for approximation and learning. Pvoc. [ E E E 78,1481-1497. Robson, J. G., and Troy, J. B. 1987. Nature of the maintained discharge of Q, X, and Y retinal ganglion cells of the cat. 1. Opt. Sou Am. A4, 2301-2307. Rolls, E. T. 1989. Information processing in the taste system of primates. 1. Esp. R i d . 146, 141-164. Salinas, E., and Abbott, L. F. 1994. Vector reconstruction from firing rates. 1. Comp. Neirrosci. 1, 89-107. Salinas, E., and Abbott, L. F. 1995. Transfer of coded information from sensory to motor networks. 1. Nwrosci. (in press). Sanger, T. D. 1994. Theoretical considerations for the analysis of population coding in motor cortex. Nt,riml Coriip. 6, 29-37.
Parameter Extraction from Population Codes
529
Seung, H. S., and Sompolinsky, H. 1993. Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. U.S.A. 90, 10749-10753. Snippe, H. P. 1996. Maximum likelihood estimation of stimulus parameters from sensor ensemble response profiles. Manuscript in preparation. Snippe, H. P., and Koenderink, J. J. 1992a. Discrimination thresholds for channelcoded systems. Biol. Cybern. 66, 543-551. Snippe, H. P., and Koenderink, J. J. 1992b. Information in channel-coded systems: Correlated receivers. Bid. Cybeun. 67, 183-190. Softky, W. R., and Koch, C. 1992. Cortical cells should fire regularly, but do not. Neural Comp. 4,643-646. Theunissen, F. E., and Miller, J. P. 1991. Representation of sensory information in the cricket cercal system. 11. Information theoretic calculation of system accuracy and optimal tuning-curve width of four primary interneurons. J. Neurophysio/. 66, 1690-1703. Tolhurst, D. J., Movshon, J. A., and Dean, A. F. 1983. The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res. 8, 775-785. Westheimer, G. 1986. Systems analysis of spatial vision. Vision Res. 26, 1-5. Vogels, R. 1990. Population coding of stimulus orientation by striate cortical cells. Bid. Cykrrr. 64,25-31. Vogels, R., Spileers, W., and Qrban, G. A. 1989. The response variability of striate cortical neurons in the behaving monkey. Exp. Brain Res. 77, 432-436. Young, M. P., and Yamane, S. 1992. Sparse population coding of faces in the inferotemporal cortex. Science 256, 1327-1331. Zohary, E. 1992. Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biol. Cybern. 66, 265-272. ~~
Received December 7, 1994; accepted August 15, 1995.
This article has been cited by: 2. L. Bako. 2010. Real-time classification of datasets with hardware embedded neuromorphic neural networks. Briefings in Bioinformatics 11:3, 348-363. [CrossRef] 3. Peggy Seriès, Alan A. Stocker, Eero P. Simoncelli. 2009. Is the Homunculus “Aware” of Sensory Adaptation?Is the Homunculus “Aware” of Sensory Adaptation?. Neural Computation 21:12, 3271-3304. [Abstract] [Full Text] [PDF] [PDF Plus] 4. László BakÓ, Sándor Tihamér Brassai. 2009. Embedded neural controllers based on spiking neuron models. Pollack Periodica 4:3, 143-154. [CrossRef] 5. Sophie Deneve. 2008. Bayesian Spiking Neurons II: LearningBayesian Spiking Neurons II: Learning. Neural Computation 20:1, 118-145. [Abstract] [PDF] [PDF Plus] 6. Yingwei Yu, Yoonsuck Choe. 2008. Neural model of disinhibitory interactions in the modified Poggendorff illusion. Biological Cybernetics 98:1, 75-85. [CrossRef] 7. Odelia Schwartz, Anne Hsu, Peter Dayan. 2007. Space and time in visual context. Nature Reviews Neuroscience 8:7, 522-535. [CrossRef] 8. Wei Ji Ma, Jeffrey M Beck, Peter E Latham, Alexandre Pouget. 2006. Bayesian inference with probabilistic population codes. Nature Neuroscience 9:11, 1432-1438. [CrossRef] 9. W. Michael Brown, Alex Bäcker. 2006. Optimal Neuronal Tuning for Finite Stimulus SpacesOptimal Neuronal Tuning for Finite Stimulus Spaces. Neural Computation 18:7, 1511-1526. [Abstract] [PDF] [PDF Plus] 10. Antony W. Goodwin, Heather E. Wheat. 2004. SENSORY SIGNALS IN NEURAL POPULATIONS UNDERLYING TACTILE PERCEPTION AND MANIPULATION. Annual Review of Neuroscience 27:1, 53-77. [CrossRef] 11. Rony Paz, Eilon Vaadia. 2004. Learning-Induced Improvement in Encoding and Decoding of Specific Movement Directions by Neurons in the Primary Motor Cortex. PLoS Biology 2:2, e45. [CrossRef] 12. Emmanuel Guigon . 2003. Computing with Populations of Monotonically Tuned NeuronsComputing with Populations of Monotonically Tuned Neurons. Neural Computation 15:9, 2115-2127. [Abstract] [PDF] [PDF Plus] 13. Xiaohui Xie. 2002. Threshold behaviour of the maximum likelihood method in population decoding. Network: Computation in Neural Systems 13:4, 447-456. [CrossRef] 14. M. Bethge , D. Rotermund , K. Pawelzik . 2002. Optimal Short-Term Population Coding: When Fisher Information FailsOptimal Short-Term Population Coding: When Fisher Information Fails. Neural Computation 14:10, 2317-2351. [Abstract] [PDF] [PDF Plus]
15. Emanuel Todorov . 2002. Cosine Tuning Minimizes Motor ErrorsCosine Tuning Minimizes Motor Errors. Neural Computation 14:6, 1233-1260. [Abstract] [PDF] [PDF Plus] 16. Emmanuel Guigon , Pierre Baraduc . 2002. A Neural Model of Perceptual-Motor AlignmentA Neural Model of Perceptual-Motor Alignment. Journal of Cognitive Neuroscience 14:4, 538-549. [Abstract] [PDF] [PDF Plus] 17. Si Wu , Shun-ichi Amari , Hiroyuki Nakahara . 2002. Population Coding and Decoding in a Neural Field: A Computational StudyPopulation Coding and Decoding in a Neural Field: A Computational Study. Neural Computation 14:5, 999-1026. [Abstract] [PDF] [PDF Plus] 18. Pierre Baraduc , Emmanuel Guigon . 2002. Population Computation of Vectorial TransformationsPopulation Computation of Vectorial Transformations. Neural Computation 14:4, 845-871. [Abstract] [PDF] [PDF Plus] 19. Hiroyuki Nakahara , Si Wu , Shun-ichi Amari . 2001. Attention Modulation of Neural Tuning Through Peak and Base RateAttention Modulation of Neural Tuning Through Peak and Base Rate. Neural Computation 13:9, 2031-2047. [Abstract] [PDF] [PDF Plus] 20. K. Kang, H. Sompolinsky. 2001. Mutual Information of Population Codes and Distance Measures in Probability Space. Physical Review Letters 86:21, 4958-4961. [CrossRef] 21. Si Wu , Hiroyuki Nakahara , Shun-ichi Amari . 2001. Population Coding with Correlation and an Unfaithful ModelPopulation Coding with Correlation and an Unfaithful Model. Neural Computation 13:4, 775-797. [Abstract] [PDF] [PDF Plus] 22. Christian W. Eurich , Stefan D. Wilke . 2000. Multidimensional Encoding Strategy of Spiking NeuronsMultidimensional Encoding Strategy of Spiking Neurons. Neural Computation 12:7, 1519-1529. [Abstract] [PDF] [PDF Plus] 23. James A. Bednar , Risto Miikkulainen . 2000. Tilt Aftereffects in a Self-Organizing Model of the Primary Visual CortexTilt Aftereffects in a Self-Organizing Model of the Primary Visual Cortex. Neural Computation 12:7, 1721-1740. [Abstract] [PDF] [PDF Plus] 24. Stefano Panzeri , Alessandro Treves , Simon Schultz , Edmund T. Rolls . 1999. On Decoding the Responses of a Population of Neurons from Short Time WindowsOn Decoding the Responses of a Population of Neurons from Short Time Windows. Neural Computation 11:7, 1553-1577. [Abstract] [PDF] [PDF Plus] 25. Sidney R. Lehky , Terrence J. Sejnowski . 1999. Seeing White: Qualia in the Context of Decoding Population CodesSeeing White: Qualia in the Context of Decoding Population Codes. Neural Computation 11:6, 1261-1280. [Abstract] [PDF] [PDF Plus]
26. Kechen Zhang , Terrence J. Sejnowski . 1999. Neuronal Tuning: To Sharpen or Broaden?Neuronal Tuning: To Sharpen or Broaden?. Neural Computation 11:1, 75-84. [Abstract] [PDF] [PDF Plus] 27. Alexandre Pouget , Sophie Deneve , Jean-Christophe Ducom , Peter E. Latham . 1999. Narrow Versus Wide Tuning Curves: What's Best for a Population Code?Narrow Versus Wide Tuning Curves: What's Best for a Population Code?. Neural Computation 11:1, 85-90. [Abstract] [PDF] [PDF Plus] 28. L. F. Abbott , Peter Dayan . 1999. The Effect of Correlated Variability on the Accuracy of a Population CodeThe Effect of Correlated Variability on the Accuracy of a Population Code. Neural Computation 11:1, 91-101. [Abstract] [PDF] [PDF Plus] 29. Nicolas Brunel , Jean-Pierre Nadal . 1998. Mutual Information, Fisher Information, and Population CodingMutual Information, Fisher Information, and Population Coding. Neural Computation 10:7, 1731-1757. [Abstract] [PDF] [PDF Plus] 30. Terence David Sanger . 1998. Probability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking NeuronsProbability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking Neurons. Neural Computation 10:6, 1567-1586. [Abstract] [PDF] [PDF Plus] 31. Alexandre Pouget , Kechen Zhang , Sophie Deneve , Peter E. Latham . 1998. Statistically Efficient Estimation Using Population CodingStatistically Efficient Estimation Using Population Coding. Neural Computation 10:2, 373-401. [Abstract] [PDF] [PDF Plus] 32. Richard S. Zemel , Peter Dayan , Alexandre Pouget . 1998. Probabilistic Interpretation of Population CodesProbabilistic Interpretation of Population Codes. Neural Computation 10:2, 403-430. [Abstract] [PDF] [PDF Plus]
Communicated by Anthony Zador
Energy Efficient Neural Codes William B Levy Robert A. Baxter Department of Neurosurgery, University of Virginia, Charlottesvzlle, VA 22908 USA
In 1969 Barlow introduced the phrase ”economy of impulses” to express the tendency for successive neural systems to use lower and lower levels of cell firings to produce equivalent encodings. From this viewpoint, the ultimate economy of impulses is a neural code of minimal redundancy. The hypothesis motivating our research is that energy expenditures, e.g., the metabolic cost of recovering from an action potential relative to the cost of inactivity, should also be factored into the economy of impulses. In fact, coding schemes with the largest representational capacity are not, in general, optimal when energy expenditures are taken into account. We show that for both binary and analog neurons, increased energy expenditure per neuron implies a decrease in average firing rate if energy efficient information transmission is to be maintained. 1 Introduction A theory of neuronal encoding (Barlow 1959) hypothesizes that minimizing the representational entropy is highly desirable and, therefore, is a driving force in the evolution of neuronal codes. Clearly many neural computations can be interpreted from this perspective. However, from an engineering perspective, natural selection might have to make some compromises. For example, the brain is one of the metabolically most active organs of the body. In children, when a large number of neuronal codes are developing, the brain accounts for up to 50% of the resting total body oxygen consumption (Sokoloff 1989). Because the brain uses so much energy, there is the problem of energy efficient use of neurons. Thus, neuronal representations may be optimal neither for the information they carry nor for the energy they use; rather there might be some sort of optimal compromise. Here we will examine this combination of concerns. One of the major goals of neuroscience is to understand the codes used by the brain for making sense of the endless stream of signals encountered as it interacts with the environment. One pivotal design consideration for any information processing system is the representational Neural Computation 8, 531-543 (1996) @ 1996 Massachusetts Institute of Technology
532
William B Levy and Robert A. Baxter
capacity of the coding scheme, i.e., the number of distinct patterns that can be represented. Ever since Barlow’s introduction of the economy of impulses (Barlow 1969), researchers have tended to concentrate on codes that minimize redundant information and maximize representational capacity (Atick 1992; Adelsberger-Mangan and Levy 1992; Foldiak 1990; Redlich 1993). We (Levy 1985) certainly agree with Barlow that low redundancy codes are desirable, but pure coding considerations (e.g., redundancy or statistical dependence) may not be the sole parameter driving the evolution of neural codes (Barlow 1959). Here, we promote the hypothesis that energy expenditures should also be factored into the economy of impulses.’ Consideration of the relative metabolic cost of generating an action potential versus inactivity over an equivalent time period leads to a unique maximization of the ratio of representational capacity to energy expended. This maximization thus sets the spike frequency. We consider this maximization for both binary and analog neurons. The binary case is the simplest case because each neuron has only two possible states: on or off. For this reason, we first discuss the binary case in Section 2 and leave the analog case for Section 3. 2 Case 1: Binary Neurons
Under the constraint that a fraction of the cells are active on average for any given stimulus, we develop an expression for the representational capacity of a neuronal population composed of n neurons with a fraction of the cells active on average for any given stimulus. Next we develop an expression for the energy expended to support such a neural code. Finally, maximization of the ratio of the representational capacity to the energy expended determines how many cells should be active on average (i.e., the firing probabilities). 2.1 Representational Capacity. We first obtain an expression for the representational capacity of a population of neurons with a fraction of the cells firing. This result generalizes to the case in which any number of cells is firing, provided p represents the average fraction of cells firing or the firing probability of any cell in the neuronal population. In distributed coding schemes with a fixed number of cells active for any given stimulus, a critical question is: How many coding cells should be active relative to the number of cells available? For example, the so called ”grandmother” coding scheme allows only one cell to be active for any given stimulus and has a very limited representational capacitythe number of cells available is the number of distinct patterns that can be represented. In contrast, if half of all available cells are active on ‘We thank the reviewers for pointing out that Softky and Kammen (1991) mentioned this hypothesis previously.
Energy Efficient Neural Codes
533
average then the representational capacity is maximized, assuming the neurons are binary signaling devices. Thus, for a coding scheme with a fixed number of cells active, the optimum coding strategy (in the sense of capacity) requires half of all available cells to fire for any given stimulus. To obtain an expression for the representational capacity of a population of neurons with a specified fraction of the cells firing on average, we begin with two simplifying assumptions. In this section, we assume that each cell in a population of neurons acts like a binary signaling device. Second, we assume that the number of neurons is large. The maximum representational capacity of a population of n cells with binary outputs is 2" codewords, or n bits, when the number of active cells is unrestricted. The representational capacity of a population of n cells with only np cells !] which active over a given time interval 'T is n ! / [ ( n p ) ! ( n - n p )codewords, is simply the expression for the number of all possible codes of length rr with exactly np cells active, where p is any number on the closed interval 10: 11 such that np is an integer. The representational capacity expressed in bits per unit time is obtained by taking the base 2 logarithm of the number of codewords:
C ( n , n p ) = log,
1
n! (np)!(n- np)!
(2.1) n H ( p ) - log, ( 0 6 ) ZZ n H ( p ) where H ( p ) is Shannon's entropy function of a binary event with probability p ; H ( p ) z -plog,p - (1 - p ) log,(l - p ) . Here, 0 is the standard The simplifideviation of the binomial distribution, i.e., Q = cation C ( n ,np) x nH(p) is based on two approximations. First, we used Stirling's formula (Mathews and Walker 1970) to compute the factorials, e.g., n! is approximately finn+1/2e-n. For n 2 100, this approximation yields less than 0.1% error. Second, since n is quite large in neuronal networks, the term log,(a&) is dropped because, as n becomes increasingly large, this term increases as log,(n) whereas n H ( p ) grows at the rate 2.1. If the code word length is allowed to vary as a binomial distribution around a mean value, the same result is obtained. =
d m .
2.2 Energy Expenditure. Next we need an expression for the energy expended per unit time when a code uses an average of np neurons active with the remaining n(1 - p ) neurons inactive. Assume that a resting neuron expends one energy unit in one time unit r, and that it expends r units of energy when it fires in a time interval r. Then the average amount of energy expended per time unit for a code with np cells active on average is
E
= n(l
-
p ) + MPY
= n[l
+ P(Y
-
l)]
(2.2)
We have left Y as a parameter since the value of Y will vary from one neuronal type to another. However, existing data can be used to estimate
534
William B Levy and Robert A. Baxter
a reasonable range of values. Sokoloff’s laboratory has estimated energy consumption in the rat by measuring glucose utilization in the superior cervical ganglion as a function of stimulation frequency in peripheral sympathetic nerves (Yarowsky et d. 1985; Yarowsky et al. 1983) and in the spinal cord (Kadekaro et al. 1985). Because of the poor frequency following characteristic of the sympathetic nervous system, we have used only the low frequency portion of their curves to estimate the value of r. In the case of the carotid nerve, the value of I‘ is approximately 40; for the spinal cord, I’ is much higher, approximately 160. Earlier experiments by Ritchie and Straub (1980) in which oxygen consumption and phosphate efflux were measured in sensory nerve fibers yield estimates of r as approximately 30 and 75, depending upon which of the two measures is used. Therefore, it seems reasonable to hypothesize that values of r range from 10 to 200. 2.3 Maximizing CIE. We now form the ratio of representational capacity to energy expended, expressed in bits/energy unit, as
(2.3)
and determine what values of 17 maximize this ratio. We refer to networks that maximize this ratio as energy efficient. Note that the time interval assumed for both C and E cancels, and, because ti also cancels, that the approximation for this ratio is independent of t i . The lack of dependence on I I implies that the niasiiiiizatioii o f C / E caiz occur locally (i.e., on a neuron by neuron basis). To see that this ratio makes sense, consider the improbable case in which the metabolic cost of generating an action potential is the same as the cost of staying at rest, i.e., r = 1. For this case, the maximum of C / E occurs at p = 0.5, i.e., when only half of the neurons are active on average. This is the value of p that maximizes the representational capacity exclusively. As I’ increases, however, the optimal value of p decreases. The value of C/E is graphed as a function of p for r = 1.10, and 100 in Figure 1. The maximum CIE occurs at p = 0.16 for I’ = 10, p = 0.03 for r = 100, and p = 0.005 for r = 1000. Figure 2 shows how the value of p that maximizes CIE decreases as a function of r. Since our calculation from the energy consumption data implies that r ranges from 10 to 200, the value of p that maximizes C/E should range from 0.16 to 0.02. For many cells in the brain, 400 Hz can be used as a reasonable estimate of the maximum firing frequency. Then, with an assumed maximal frequency of 400 Hz, the implied average activity levels range from 64 down to 8 Hz. We note that this range of activity levels is quite close to experimental observations (Sharp and Green 1994).
Energy Efficient Neural Codes
535
1.c
0.E
0.6
w
%
0 0.4
0.2
r=lOO
0.0
0.0
0.1
I 0.2
I
I
0.3
0.4
I I
0.5
0.6
P
Fiuiire 1 . The ratin nf rpnresentatinnal mnacitv tn eneruv pvnpnrlitiire (f/ F )
ic
a concave function of the average firing probability ( p ) . Here we plot three such curves corresponding to three different values of Y, the relative energy expended for an action potential compared to remaining at rest. Note how the maxima across curves decrease as action potentials become more energetically expensive. 3 Case 2: Analog Neurons
In the binary case, we assumed there were n cells with np cells active. We then obtained expressions for the representational (information) capacity of the n neurons and their energy expenditure. From this result, we could simply generalize the notion of ”active cells” to cells with firing frequencies greater than some threshold frequency (perhaps corresponding to the spontaneous firing frequency). However, this extension is trivial since the code remains binary. A much more interesting and useful extension is to consider neurons with firing frequencies that span a known range. The firing frequencies of such analog neurons are thought of as
William B Levy and Robert A. Baxter
536
1.o
$ '8 UJ
0.1
.-Ex
E-
4-
m
5 P 0.01
-m Q,
3
>
0.001 1
100
10
r
Figure 2: As the relative energy expended for an action potential compared to remaining at rest ( r ) increases, the optimal firing probabilities for maximizing the ratio of representational capacity to energy expended ( C / E ) decrease. This decrease is approximately linear on logarithmic scales and can be approximated as kirnnh = 0 . W 0 ' , The filled circles in this graph represent maxima of C/E tor different values of r. These maxima were determined using the bisection method to solve for the roots of O(C/€)/Op = 0. their activity levels. Therefore, in the case of analog neurons, we seek to maximize the efficiency given the mean firing frequency. 3.1 Representational Capacity. We start by specifying a time interval T over which the spike frequency is determined. Each of the )I cells has a firing frequency f, determined by j / T , where j is the spike count over the time interval T. The representational capacity is infinite if any number of spikes can be generated within the interval T-regardless of the value of T provided T is nonzero. However, for biological neurons we know that 1 has an upper bound N-which results in an upper bound on the fi's,
Energy Efficient Neural Codes
537
say f N . Therefore, we must determine the representational capacity of n neurons, each of which can take on only the discrete firing frequencies t; = (0, 1 / T , 2 / T t . .. .N/T}.Let the probability that a neuron emits j spikes in the interval T be denoted by p ] . Then the representational capacity per time interval T of neuron i is given by N
(3.1) Using arguments similar to those in Section 2, the representational capacity of n such neurons is C sz nCi. Note that C has an upper bound of n log,(N + 1) bits. We must now express Ci as a function of the average firing frequency of the n neurons. Let the average spike count in the interval T be denoted by p, i.e., p = E i ] . We need the discrete probability distribution p,. The principle of maximum entropy provides the most appropriate distribution given the available information (Jaynes 1957). Because we have the mean, the distribution that maximizes the entropy with no additional information about pi is the geometric (see the Appendix for a proof). The geometric distribution can be written as a function of its mean, p, as
Substituting the geometric distribution into equation 3.1 yields Cj = - p log,
/L
+ (1 + p ) log,(l+ p )
(3.3)
where we have let N -+ a.Note that the functional forms of Ci for the binary and analog cases (equations 2.1 and 3.3) are similar except for the signs in the second term. We can relate p to the mean firing frequency in Hz, f, by the expression u , = fT,which allows C; to be expressed in terms of the mean firing frequency and the averaging time period as
Ci(f , T ) = -fT log, V T ) + (1 + fT) log, (1 + f T )
(3.4)
3.2 Energy Expenditure. Because it takes energy to repolarize the membrane after an action potential, a neuron uses more energy when firing action potentials than when inactive. For the purposes of our firstorder estimate here and to be consistent with the binary case, we will assume that the energy needed by a neuron increases linearly with its frequency of firing. Specific energy consumption values are not needed. In this case consider the energy expended at rest relative to the energy expended after an action potential occurs. That is, assume that the energy consumed by a quiescent neuron is one unit and that energy consumption increases linearly with mean firing rate,
6.
William B Levy and Robert A. Baxter
538
Under this assumption, the energy expended by neuron i can be expressed as E, = 1t k f , . Assuming that the mean firing rate of each neuron is equivalent to the mean firing rate of the population of neurons, the energy expended by all neurons is
,
E
=
/I
C(1 + kf,) = C ( 1 + kf ) = I=
1
( 1 + kf )
(3.5)
,=I
where, as in equation 3.3, f denotes the mean firing frequency of all cells. Dimensional analysis and a comparison of equations 2.2 and 3.5 reveals that k = ( r - l ) / f ~Using . the same reasoning as in Section 2.2, experimental values of r should range from 10 to 200. As will become apparent in the next subsection, the value of I' affects the optimization calculations in a manner similar to the binary case. 3.3 Maximizing CIE. The expression for C I E in the analog neuron case follows from equations 3.4 and 3.5 of the previous two subsections and the approximation C z uCf,
and is independent of the number of neurons as in the binary case. Note that the corresponding equation for the binary case, equation 2.3, is similar in structure. Curves of C I E as a function of with T = 25 msec for various values of r are shown in Figure 3 and have shapes similar to those of the binary case shown in Figure 1. The curves in Figure 3 show that a 10-fold increase in the value of r results in more than a 5-fold decrease in the optimal mean frequency. If we set the maximum firing frequency to 400 Hz and let T = 25 msec, then N = 10. For r = 10, the value of f that maximizes C / E is 43 Hz; for r = 200, the optimal frequency is 6 Hz. These values span a narrower range than the 8-64 Hz range cited in the binary case. Even so, the overlap between the ranges is remarkable, implying that this theory is relatively independent of the type of neural code. From a physiological viewpoint, we can associate the averaging time with the time constant of the neurons. Because the time constant of neurons is sure to vary and is still a matter of conjecture for animals in their natural environments, we calculated the optimal mean frequencies, as a function of Y, for averaging times of 10,25, and 50 msec. It is apparent from Figure 4 that the averaging time has little effect on the result. Shorter averaging times simply raise the optimal mean frequencies.
7
4 Summary
Given a population of n binary neurons with np neurons firing on average, the coding scheme with p = 0.5 has the largest representational
Energy Efficient Neural Codes
1.5
I
539
I
I
T=25ms
1.o
W
3 0.5
r=lOO 0.0 0
50
100
150
200
Mean Firing Frequency, Hz
Figure 3: Energy efficiency curves for the analog case. The ratio of representational capacity to energy expenditure ( C / E ) is a concave function of the average firing frequency ( f ) . Here we plot three such curves corresponding to three different values of Y, the relative energy expended for an action potential compared to remaining at rest (with an assumed averaging time period of 25 msec and an assumed maximum firing frequency of 400 Hz). Note how the maxima across curves decrease as action potentials become more energetically expensive (i.e., as Y increases). These curves are similar in shape to those of the binary case (see Fig. 1). capacity; that is, on average, half of the neurons are active for any given stimulus. We have shown, however, that this maximum capacity coding scheme is not the most energy efficient scheme. Binary codes that maximize representational capacity and minimize energy expenditure use 2-16% of the number of cells available to fire for any given stimulus, which is considerably less than half of the available neurons. In the case of analog neurons, the mean firing frequency that maximizes the ratio of the representational capacity to the energy expended is a function of the time interval over which the frequency is determined
William B Levy and Robert A. Baxter
540
T=25 -
T=50
1
100
10
1000
r
Figure 4: As the relative energy expended for an action potential compared to remaining at rest (1.) increases, the optimal firing frequencies for maximizing the ratio of representational capacity to energy expended ( C / O decrease. This decrease is approximately linear on logarithmic scales. The three curves correspond to three different values of the averaging time interval, T. As T increases, the optimal firing frequencies decrease. The filled circles in this graph represent maxima of C / E for different values of r.
and the maximum possible firing frequency. With an averaging time interval of 25 msec and a maximum firing frequency of 400 Hz the optimal mean frequency ranges from 6 to 43 Hz, depending on the metabolic cost of an action potential. Our results tend to support the notion that as neuronal firing becomes energetically more expensive, lower firing rates become more beneficial. However, our results also show that very low firing rates can be detrimental to energy efficiency. Most importantly, our results support the notion of an energy efficient firing rate. Neurons firing above or below this rate transmit information less efficiently. Having made these observations relating efficiency ( C / E ) to the rel-
Energy Efficient Neural Codes
541
ative metabolic cost of generating an action potential (Y), and having computed firing rates that are efficient, we might compare our results to values actually found in the brain. To make this comparison, it is important to consider a natural situation. Cell firing in anesthetized animals and cell firing triggered by experimenter controlled stimulus presentation seem inappropriate. We are interested in cell firing as an animal behaves more or less naturally in a setting similar to its native environment. Although it is still not possible to measure the activities of a statistically large number of cells simultaneously, open field spatial mapping studies of limbic system activity seem most relevant. In such studies animals are allowed to wander through the environment and cell firing occurs at each place in the open field. In the rat subiculum (a region of the limbic system) there are two types of frequency firings noted for four types of neurons (Sharp and Green 1994). One type of neuron has an average frequency firing of 9 Hz and is the primary coding element presumably for head direction and where the animal is going. Another class of neurons, presumably interneurons, is characterized by relatively high firing rates (about 37 Hz). Of all cells measured, significantly fewer cells belonged to this second class. Assuming both cell types have similar metabolic costs, this may be interpreted as a confirmation that nature favors energy efficient cells. The observed firing rates of the primary cells fall at the lower end of our computed optimum range while those of the interneurons are near the higher end of our computed values. It would be remarkable if we were fortunate enough to pick exactly the two parameters that nature has chosen to optimize. Undoubtedly, there are other concerns that nature had in mind when it evolved the frequency firings of various neuronal types. Here, we provide a plausible explanation of why and how the evolution of the brain may have been constrained by energy efficiency as well as statistical dependencies and representational entropies. We have also shown how these constraints relate to observed neural firing frequencies. Besides enhancing our understanding of the tradeoffs between representational capacity and energy conservation that nature has resolved, this work may be of practical importance. It offers a rationale for designing networks that, instead of maximizing representational capacity, provide an optimal compromise between capacity and energy use. For example, networks with distributed codes that use a large percentage of the available neurons have large representational capacities but may require significant energy, while networks with grandmother codes (e.g., nearest-neighbor networks) generally use less energy but have lower representational capacities. Furthermore, although we have focused on the tradeoff between representational capacity and energy expenditure in biological systems, this optimization approach may also be useful in the design of massively parallel systems where energy consumption and heat dissipation are critical concerns.
William B Levy and Robert A. Baxter
542
Appendix In this appendix, we provide a proof that the geometric distribution given in equation 3.2 is the distribution that maximizes the entropy. The maximum entropy distribution has the form (Jaynes 1957) (A.1)
1 -- e-XO-X”
with the restriction CE,p, = 1. We will assume that Let t = ephl. Then
Since we are given the mean of the distribution, x
x
i=O
j=o
N
+ 3c.
E l ] = 11,
we know that
p=Cjpr=(l-tt)Cjt’
(A.3)
By taking the derivative of 1/(1 - t ) , we obtain the relation ‘ Y
which, upon substitution into the expression for /I
=
t
I-t
~
//,
yields (A.5)
Solving for t in terms of / I and substituting into the expression for pl yields the geometric distribution.
Acknowledgments This research was supported in part by NIH MH00622, MH48161, and Electric Power Research Institute (EPRI) RP8030-08 to W.B.L., Baxter Research, and the Department of Neurological Surgery at the University of Virginia, Dr. John A. Jane, Chairman.
References Adelsberger-Mangan, D. M., and Levy, W. B. 1992. Information maintenance and statistical dependence reduction in simple neural networks. Biol. Cybern. 67, 469477. Atick, J . J. 1992. Could information theory provide an ecological theory of sensory processing? Network 3, 213-251. Barlow, H. B. 1959. S ~ I I I ~ O S clii ~ I tlw I I I Mechanization I of Tliouglzt Procmes. H. M. Stationary, London, No. 10, pp. 535-539.
Energy Efficient Neural Codes
543
Barlow, H. B. 1969. Trigger features, adaptation and economy of impulses. In Information Processing in the Nervous System, K. N. Leibovic, ed., pp. 209-226. Springer-Verlag, New York. Cover, T. M., and Thomas, J. A. 1991. Elements oflnformation Theoy. John Wiley, New York. Foldiak, P. 1990. Forming sparse representations by local anti-hebbian learning. Biol. Cybern. 64,165-170. Jaynes, E. T. 1957. Information theory and statistical mechanics I, 11. Phys. Rev. 106,620-630; 108, 171-190. Kadekaro, M., Crane, A. M., and Sokoloff, L. 1985. Differential effects of electrical stimulation of sciatic nerve on metabolic activity in spinal cord and dorsal root ganglion in the rat. Proc. Natl. Acad. Sci. U.S.A. 82, 6010-6013. Levy, W. B. 1985. An information/computation theory of hippocampal function. SOC.Neurosci. Abstr. 11, 493. Mathews, J., and Walker, R. L. 1970. Mathematical Methods of Physics, 2nd ed. Benjamin/Cummings, Menlo Park, CA. Redlich, A. N. 1993. Supervised factorial learning. Neural Comp. 5, 750-766. Ritchie, J. M., and Straub, R. W. 1980. Oxygen consumption and phosphate efflux in mammalian non-myelinated nerve fibers. J. Physiol., London 304, 109-121. Sharp, P. E., and Green, C. 1994. Spatial correlates of firing patterns of single cells in the subiculum of the freely moving rat. J. Neurosci. 14(4), 2339-2356. Softky, W. R., and Kammen, D. M. 1991. Correlations in high dimensional or asymmetric data sets: Hebbian neuronal processing. Neural Networks 4, 337-347. Sokoloff, L. 1989. Circulation and energy metabolism of the brain. In Basic Neurochernisty : Molecular, Cellular, and Medical Aspects, 4th ed., G. J. Siegel, 8. W. Agranoff, R. W. Albers, and P. B. Molinoff, eds., pp. 565-590. Raven Press, New York. Yarowsky, P., Kadekaro, M., and Sokoloff, L. 1983. Frequency-dependent activation of glucose utilization in the superior cervical ganglion by electrical stimulation of cervical sympathetic trunk. Proc. Natl. Acad. Sci. U.S.A. 80, 4179-4183. Yarowsky, P., Crane, A., and Sokoloff, L. 1985. Metabolic activiation of specific postsynaptic elements in superior cervical ganglion by antidromic stimulation of external carotid nerve. Brain Res. 334, 330-334.
Received January 16, 1995; accepted May 30, 1995.
This article has been cited by: 2. Jeffrey Seely, Patrick Crotty. 2010. Optimization of the leak conductance in the squid giant axon. Physical Review E 82:2. . [CrossRef] 3. A. Hasenstaub, S. Otte, E. Callaway, T. J. Sejnowski. 2010. Metabolic cost as a unifying principle governing neuronal biophysics. Proceedings of the National Academy of Sciences 107:27, 12329-12334. [CrossRef] 4. R. Angus Silver. 2010. Neuronal arithmetic. Nature Reviews Neuroscience 11:7, 474-489. [CrossRef] 5. Lars Büsing, Benjamin Schrauwen, Robert Legenstein. 2010. Connectivity, Dynamics, and Memory in Reservoir Computing with Binary and Analog NeuronsConnectivity, Dynamics, and Memory in Reservoir Computing with Binary and Analog Neurons. Neural Computation 22:5, 1272-1311. [Abstract] [Full Text] [PDF] [PDF Plus] 6. Simo Vanni, Tom Rosenström. 2010. Local non-linear interactions in the visual cortex may reflect global decorrelation. Journal of Computational Neuroscience . [CrossRef] 7. Toby Berger, William B Levy. 2010. A Mathematical Theory of Energy Efficient Neural Computation and Communication. IEEE Transactions on Information Theory 56:2, 852-874. [CrossRef] 8. Prapun Suksompong, Toby Berger. 2010. Capacity Analysis for Integrate-and-Fire Neurons With Descending Action Potential Thresholds. IEEE Transactions on Information Theory 56:2, 838-851. [CrossRef] 9. Cesar F. Caiafa, Andrzej Cichocki. 2009. Estimation of Sparse Nonnegative Sources from Noisy Overcomplete Mixtures Using MAPEstimation of Sparse Nonnegative Sources from Noisy Overcomplete Mixtures Using MAP. Neural Computation 21:12, 3487-3518. [Abstract] [Full Text] [PDF] [PDF Plus] 10. Danielle Morel, William Levy. 2009. The cost of linearization. Journal of Computational Neuroscience 27:2, 259-275. [CrossRef] 11. Tiago Branco, Kevin Staras. 2009. The probability of neurotransmitter release: variability and feedback control at single synapses. Nature Reviews Neuroscience 10:5, 373-383. [CrossRef] 12. JAN WILTSCHUT, FRED H. HAMKER. 2009. Efficient coding correlates with spatial frequency tuning in a model of V1 receptive field organization. Visual Neuroscience 26:01, 21. [CrossRef] 13. Rubin Wang, Zhikang Zhang, Guanrong Chen. 2008. Energy Function and Energy Evolution on Neuronal Populations. IEEE Transactions on Neural Networks 19:3, 535-538. [CrossRef] 14. David H. Goldberg, Andreas G. Andreou. 2007. Distortion of Neural Signals by Spike CodingDistortion of Neural Signals by Spike Coding. Neural Computation 19:10, 2797-2839. [Abstract] [PDF] [PDF Plus]
15. Rubin Wang, Zhikang Zhang. 2007. Energy coding in biological neural networks. Cognitive Neurodynamics 1:3, 203-212. [CrossRef] 16. Leonardo Franco, Edmund T. Rolls, Nikolaos C. Aggelopoulos, Jose M. Jerez. 2007. Neuronal selectivity, population sparseness, and ergodicity in the inferior temporal visual cortex. Biological Cybernetics 96:6, 547-560. [CrossRef] 17. Udo Ernst, David Rotermund, Klaus Pawelzik. 2007. Efficient Computation Based on Stochastic SpikesEfficient Computation Based on Stochastic Spikes. Neural Computation 19:5, 1313-1343. [Abstract] [PDF] [PDF Plus] 18. Manuel F. Casanova, Imke A. J. Kooten, Andrew E. Switala, Herman Engeland, Helmut Heinsen, Harry W. M. Steinbusch, Patrick R. Hof, Juan Trippe, Janet Stone, Christoph Schmitz. 2006. Minicolumnar abnormalities in autism. Acta Neuropathologica 112:3, 287-303. [CrossRef] 19. F. Torrealdea, A. d’Anjou, M. Graña, C. Sarasola. 2006. Energy aspects of the synchronization of model neurons. Physical Review E 74:1. . [CrossRef] 20. Rubin Wang, Zhikang Zhang, Xianfa Jiao. 2006. Mechanism on brain information processing: Energy coding. Applied Physics Letters 89:12, 123903. [CrossRef] 21. David Attwell, Alasdair Gibb. 2005. Neuroenergetics and the kinetic design of excitatory synapses. Nature Reviews Neuroscience 6:11, 841-849. [CrossRef] 22. Gonzalo de Polavieja. 2004. Reliable biological communication with realistic constraints. Physical Review E 70:6. . [CrossRef] 23. Thomas Hoch, Gregor Wenning, Klaus Obermayer. 2003. Optimal noise-aided signal transmission through populations of neurons. Physical Review E 68:1. . [CrossRef] 24. Yasser Roudi, Alessandro Treves. 2003. Disappearance of spurious states in analog associative memories. Physical Review E 67:4. . [CrossRef] 25. M. Bethge , D. Rotermund , K. Pawelzik . 2002. Optimal Short-Term Population Coding: When Fisher Information FailsOptimal Short-Term Population Coding: When Fisher Information Fails. Neural Computation 14:10, 2317-2351. [Abstract] [PDF] [PDF Plus] 26. Susanne Schreiber , Christian K. Machens , Andreas. V. M. Herz , Simon B. Laughlin . 2002. Energy-Efficient Coding with Discrete Stochastic EventsEnergy-Efficient Coding with Discrete Stochastic Events. Neural Computation 14:6, 1323-1346. [Abstract] [PDF] [PDF Plus] 27. Allan Gottschalk . 2002. Derivation of the Visual Contrast Response Function by Maximizing Information RateDerivation of the Visual Contrast Response Function by Maximizing Information Rate. Neural Computation 14:3, 527-542. [Abstract] [PDF] [PDF Plus] 28. Stefan D. Wilke , Christian W. Eurich . 2002. Representational Accuracy of Stochastic Neural PopulationsRepresentational Accuracy of Stochastic Neural Populations. Neural Computation 14:1, 155-189. [Abstract] [PDF] [PDF Plus]
29. David Attwell, Simon B. Laughlin. 2001. An Energy Budget for Signaling in the Grey Matter of the Brain. Journal of Cerebral Blood Flow & Metabolism 1133-1145. [CrossRef] 30. P. Abshire, A.G. Andreou. 2001. Capacity and energy cost of information in biological and silicon photoreceptors. Proceedings of the IEEE 89:7, 1052-1064. [CrossRef] 31. Vijay Balasubramanian , Don Kimber , Michael J. Berry II . 2001. Metabolically Efficient Information ProcessingMetabolically Efficient Information Processing. Neural Computation 13:4, 799-815. [Abstract] [PDF] [PDF Plus] 32. Gianni Settanni , Alessandro Treves . 2000. Analytical Model for the Effects of Learning on Spike Count DistributionsAnalytical Model for the Effects of Learning on Spike Count Distributions. Neural Computation 12:8, 1773-1787. [Abstract] [PDF] [PDF Plus] 33. Alessandro Treves , Stefano Panzeri , Edmund T. Rolls , Michael Booth , Edward A. Wakeman . 1999. Firing Rate Distributions and Efficiency of Information Transmission of Inferior Temporal Cortex Neurons to Natural Visual StimuliFiring Rate Distributions and Efficiency of Information Transmission of Inferior Temporal Cortex Neurons to Natural Visual Stimuli. Neural Computation 11:3, 601-631. [Abstract] [PDF] [PDF Plus] 34. Brian S. Blais , N. Intrator , H. Shouval , Leon N. Cooper . 1998. Receptive Field Formation in Natural Scene Environments: Comparison of Single-Cell Learning RulesReceptive Field Formation in Natural Scene Environments: Comparison of Single-Cell Learning Rules. Neural Computation 10:7, 1797-1813. [Abstract] [PDF] [PDF Plus] 35. Toby BergerRate-Distortion Theory . [CrossRef]
Communicated by Nancy Kopell
Coupling the Neural and Physical Dynamics in Rhythmic Movements Nicholas G. Hatsopoulos* Division of Biology, CNS Program, M S 139-74, Caitech, Pasadena, CA 91125 USA
A pair of coupled oscillators simulating a central pattern generator (CPG) interacting with a pendular limb were numerically integrated. The CPG was represented as a van der Pol oscillator and the pendular limb was modeled as a linearized, hybrid spring-pendulum system. The CPG oscillator drove the pendular limb while the pendular limb modulated the frequency of the CPG. Three results were observed. First, sensory feedback influenced the oscillation frequency of the coupled system. The oscillation frequency was lower in the absence of sensory feedback. Moreover, if the muscle gain was decreased, thereby decreasing the oscillation amplitude of the pendular limb and indirectly lowering the effect of sensory feedback, the oscillation frequency decreased monotonically. This is consistent with experimental data (Williamson and Roberts 1986). Second, the CPG output usually led the angular displacement of the pendular limb by a phase of 90" regardless of the length of the limb. Third, the frequency of the coupled system tuned itself to the resonant frequency of the pendular limb. Also, the frequency of the coupled system was highly resistant to changes in the endogenous frequency of the CPG. The results of these simulations support the view that motor behavior emerges from the interaction of the neural dynamics of the nervous system and the physical dynamics of the periphery. 1 Introduction
Traditional open-loop and closed-loop control schemes both assume the origins of rhythmic motor behavior reside in the nervous system in the form of central pattern generators (CPGs) (Delcomyn 1980). The spatial and temporal characteristics of the movement are dictated by central commands or reference signals, which are followed passively by the peripheral musculoskeletal system. In closed-loop schemes, afferent feedback acts to reduce any error between the reference signal and the actual position or velocity of the plant (Merton 1953; Stein 1982). Nevertheless, ~
*Correspondence should be sent to Nicholas G. Hatsopoulos, Department of Neuroscience, Box 1953, Brown University, Providence, RI 02912.
Nertval Cornpittation 8, 567-581 (1996) @ 1996 Massachusetts Institute of Technology
568
Nicholas G. Hatsopoulos
the frequency and amplitude of the movement are centrally dictated by a series of set points. Many open-loop and closed-loop schemes, however, suffer from inflexibility. They are unable to adapt to changing internal and external conditions that affect the musculoskeletal system. For example, the skeletal frame of a human can grow by over a factor of three from infancy to adulthood. Likewise, the mass of the body will increase by over one order of magnitude. Moreover, the body is exposed to transient loads when carrying objects or sustaining perturbations. In addition, the torque-generating capabilities of muscle can vary with time. Muscle can become more effective with exercise or less effective with injury or lack of use. All of these changes will alter the relationship between the neural signals emanating from the CPG and the resulting torques and movements generated about the joints. The Russian physiologist Bernstein (1967) recognized this problem and proposed an approach to motor control that emphasized the exploitation of physical dynamics in movement generation (Schneider et al. 1989, 1990; Hoy and Zernicke 1985, 1986). This approach leads to the idea that stable movement patterns emerge from an interaction between the neural dynamics of the nervous system and the physical dynamics of the musculoskeletal system (Hatsopoulos 1992; Hatsopoulos et al. 1992). According to this approach, sensory afferents from the muscles, joints, and skin act to couple the physical dynamics to the neural dynamics just as the muscles couple the neural dynamics to the physical dynamics. There are two important properties of motor rhythms that support this idea. First, the frequency of the motor rhythm can be modulated by changes in the strength of sensory feedback. In many systems, the frequency of a motor rhythm is higher with sensory feedback under intact conditions than it is without feedback (Wilson 1961; Grillner and Zangger 1979; Wallen and Williams 1984; Williamson and Roberts 1986; Mos and Roberts 1994). Wilson (1961) found that the wing beat frequency of tHe locust decreased gradually as more sensory nerves from the wings were cut. Wing beat frequency decreased by as much as a factor of two after complete deafferentation. Williamson and Roberts (1986) showed that the frequency of the dogfish swimming rhythm decreased as the extent of sensory feedback was indirectly decreased. This was accomplished by applying curare to the muscles controlling the movement. The second and more important property is that the frequency of a motor rhythm depends on the inertial and gravitational properties of the periphery. In particular, the frequency of a motor rhythm will very often scale with or match the resonant frequency of the musculoskeletal system.' The property of resonance tuning has been demonstrated ir, both observational field studies and in more formal experimental re'The resonant frequency of the musculoskeletal system can also depend on local, iensory feedback loops. Bassler (1983)showed that the closed-loop, femorotibial system of the stick insect possesses a resonant frequency between 1 and 3 Hz. He postulated
Coupling Neural and Physical Dynamics
569
search (Kugler and Turvey 1987). Alexander and Jayes (1983) proposed an allometric model of walking that predicted a ”pendular” scaling between the frequency of walking and body height: walking frequency under ”normal” conditions should scale with the square root of the reciprocal of the body height. This prediction was borne out by observing the stepping frequency of a number of animals in the wild (Pennycuick 1975). More recently, Holt ef al. (1990) asked subjects to walk with ankle weights of different masses and showed that they scale their preferred frequency with the resonant frequency of the leg. Kugler and Turvey (1987) noted that both the stiffness of the joint and the physical properties of the limb (and any objects carried by the limb) will affect the resonant frequency of the periphery. They proposed a hybrid spring-pendulum model characterized by a forced, linear, secondorder differential equation with a sinusoidal forcing function whose amplitude and frequency are To and w,respectively: 18
+ c0 + (rngl,, + k)Q = Tosinwt
(1.1)
where I is the moment of inertia of the limb, Q is the angular displacement of the pendulum, c is the damping coefficient, m is the mass of the system, g is the acceleration of gravity, I,, is the distance of the center of mass from the axis of rotation, and k is the stiffness of the joint.* The angular resonant frequency3of the system is wr =
J’j&GXjP=-4
(1.2)
where L,, is the length of an equivalent simple pendulum whose mass is concentrated at one point. Kugler and Turvey hypothesized that the system tunes itself to a frequency that matches the resonant frequency of this hybrid model. Hatsopoulos and Warren (1995) have recently provided experimental support for this hypothesis. Flight in insects and birds provides additional evidence for this hypothesis. Due to the size and orientation of the wings with respect to gravity, it is assumed that joint stiffness plays a larger role than does gravity, and, therefore, the second term in equation 1.2 will dominate. Sotavalta (1954) performed a number of experiments with moths and cockroaches in which the inertia of the wings was varied by either adding loads to the wings or clipping the wings. The wing beat frequency scaled with the inertia raised to a power ranging from -0.12 to -0.47 which is close to -0.5 predicted by the hypothesis. In an observational study, Greenewalt (1975) showed that the wing beat frequency of birds was proportional to the wing length raised to the power of -0.91 and higher. that the stick insect takes advantage of the resonant frequency by tuning the frequency of its rocking behavior to that frequency. 2The pendulum is treated as a simple pendulum in these simulations so that I, = L where L is the length of the pendulum. Also I=mL2. 3Actually, the resonant frequency is shifted down slightly by the viscosity of the system: w,= J(mgl,,, + k ) / l - c2/4I2.
570
Nicholas G. Hatsopoulos
If the length of the wing is much larger than the other two dimensions, the hypothesis would predict a scaling power of -1.0. I have shown previously how a spinal pattern generator model (Miller and Scott 1977) coupled to a pendular limb can exhibit the properties described above (Hatsopoulos rf al. 1992). In this paper, I generalize these results by replacing the neural pattern generator with a more abstract, limit-cycle oscillator. I performed a set of simulations using a van der Pol oscillator coupled to a linearized hybrid spring-pendulum system. Since the nature of the pattern generator circuit for walking (in any system) is still poorly understood, it is important to demonstrate that these properties are a general consequence of the two-way coupling and not unique to one type of central pattern generator model. 2 Method 2.1 The Oscillators. The following equation describes the dynamics of the CPG modeled by a van der Pol oscillator, which yields a limit-cycle attractor in phase space:
y + F(y2 - 1)q + L2y = 0
(2.1)
where y is the output of the CPG oscillator and is unitless, c is a parameter that affects the shape of the limit-cycle, and d is the angular frequency of the oscillator. The nonlinear, dissipative term is responsible for the existence of a limit cycle by acting either as a damper when y is greater than 1 or as an excitor when y is less than 1. The dynamics of the limb segment were modeled by a linearized pendulum described by equation 1.1. The pendulum equation was linearized to make analysis simpler. If the oscillation amplitude is kept small (< 20"), the linearization of the pendulum oscillator is appropriate (Seto 1964). The numerical simulations were performed with both the linearized and nonlinear versions of the pendulum equation. The results of the nonlinear version were qualitatively similar and so are not reported. 2.2 The Coupling Equations. The CPG oscillator drove the pendulum via a spring-like muscle whose resting or equilibrium length is proportional to the output of the CPG oscillator (Fel'dman 1966).4The driving torque is, therefore, proportional to the difference between the position of the pendulum and the output of the oscillator:
T = -k [6'
-
( G / k ) y ] -kB 1
+ Gy
(2.2)
-
4Actually,equation 2.2 represents the torque generated by an agonist-antagonist pair of muscles because a single muscle can generate torque only in one direction. Fel'dman (1966) demonstrated that under static conditions a fixed level of muscle activity defines a unique torque-angle curve with a certain equilibrium position.
Coupling Neural and Physical Dynamics
571
where k is the stiffness of the muscle and G is the muscle gain parameter whose units are in Newton-meters. If k is zero, the neural oscillator acts like a pure torque d r i ~ e r . An ~ alternate muscle model postulated by Hogan (1984) was also used in one set of simulations: T
1
-ky [O - (G/k)] = -kyH
+ Gy
(2.3)
In this model, the stiffness as well as the equilibrium position of the muscle are modulated by the output of the CPG oscillator. The results of this set of simulations are qualitatively similar to those using the first muscle model in relation to resonance tuning, and so are not reported in the Results section. Positional feedback from the pendular limb acted on the CPG by modulating its frequency: w = LJO
+ BO
(2.4)
where wo is the endogenous frequency of the CPG oscillator without proprioceptive feedback, B is the feedback gain, and Q is the angular displacement of the pendular limb. Thus, the CPG model was a nonlinear, parametrically modulated oscillator.6 Two variations of this feedback equation were also used. First, the angular position of the pendulum was replaced with the absolute value of the angular position. In many ways, this represents a more realistic feedback model since muscle spindles act as rectifying position sensors. In this case, positional feedback comes from the agonist during one half of the cycle and from the antagonist muscle during the other half cycle. Second, positional feedback was replaced with torque feedback by using the full-wave rectified angular acceleration of the pendulum. The results using these two variations of the feedback equation were similar to the original as far as resonance tuning was concerned. That is, the frequency of the entrained system increased linearly with the resonant frequency of the pendulum for resonant frequencies ranging from 0.5 to about 1.1 Hz. The results that are reported assume the feedback model represented in equation 2.4 unless otherwise noted. There is indirect and direct experimental evidence for the general formulation of the feedback equation in which proprioceptive afferents act to modulate the endogenous frequency of the CPG. Some indirect evidence was reported above in which the removal of sensory feedback decreased the frequency of a motor rhythm (Wilson 1961; Grillner and Zangger 1979; Wallkn and Williams 1984; Williamson and Roberts 1986; Mos and Roberts 1994). In addition, it has been demonstrated that tonic electrical stimulation from certain supraspinal areas in cat will induce fictive locomotion whose frequency is modulated by the stimulus strength (Shik et 5k was set to zero in all the simulations to be reported in this paper. 6The CI'G model is basically a hybrid van der Pol and Mathieu oscillator (Jordan and Smith 1989).
572
Nicholas G. Hatsopoulos Table 1: Parameter Values Used in Simulations m(kg) L (rn)
c (N-m-sec/rad) f
C (N-ni) B (sec-') .io (rad/sec)
10 0.1-10.0
0.5-1 .O 0.5 0.25-1.6 0-50 0-5
nl. 1966). This suggests that tonic synaptic input to the pattern generator network can affect its output frequency. Therefore, if proprioceptive afferents have synaptic influence on the pattern generator network, they could, in principle, affect the CPG frequency. More direct evidence includes the finding that tonic electrical stimulation of sensory nerves in the thoracic ganglia of locusts will increase the wing beat frequency (Wilson 1964). The simulations involved the numerical integration of the coupled set of differential equations with a fourth-order Rung-Kutta integrator whose time step was fixed at either 2.5 or 5 msec. Table 1 shows the range of parameter values used in the simulations. 3 Results
3.1 Frequency vs. Amplitude. I conducted a set o f simulations to investigate the relationship between oscillation frequency and amplitude in the pendular limb. The amplitude of oscillations was modulated by changing the muscle gain, G, in equation 2.2. As expected, the oscillation amplitude decreased as the muscle gain was decreased. More surprisingly, the oscillation period increased as the amplitude decreased. The period and amplitude of the pendular limb are plotted as a function of cycle number in Figure 1A. These step-wise changes in muscle gain were performed to simulate the effects of intravenous injection of curare into the muscles of the dogfish mediating swimming (Williamson and Roberts 1986) (Fig. 1B). As the curare takes effect, the period of the oscillating body increases as the amplitude decreases. After about 80 sec from the time of injection, the period and amplitude return to their normal values. 3.2 Phase. A striking feature of the simulations is the relative phase between the torque generated by the CPG oscillator and the angular displacement of the pendular limb (Fig. 2). The endogenous frequency of the CPG oscillator without feedback, LO, is set at 1 rad/sec or 0.16 Hz. The length of the pendulum is set at one of four values: 0.1, 0.2, 0.4, and
Coupling Neural and Physical Dynamics
573
~~
A
Simulation
cycle number
B
Experiment
timely
Figure I: (A) The amplitude (open circles) and period (solid circIes) of the pendular limb versus cycle number as the muscle gain, G, is varied. Inset: muscle gain, G, as a function of time. (B) The amplitude (open triangles) and period (solid triangles) of the dogfish swimming rhythm versus time as the level of curare in the muscle varied (copied from Williamson and Roberts 1986). Intravenous injection of curare is applied at the arrow.
0.8 m, which corresponds to a resonant frequency of about 1.58, 1.11, 0.79, and 0.56 Hz, respectively. With feedback, the torque and, therefore, the CPG output (the torque is in-phase with the the CPG output; the two
Nicholas G. Hatsopoulos
574
A
WITH FEEDBACK
L=O.l m
L=O.Z rn
L=O 4 m
L=O 8 m
B
WITHOUT FEEDBACK
5 s L=O 1 m
...
.. L=O 2 rn L=O 4rn
C-0 Em
Figure 2: The torque generated by the CPG (solid lines) and angular displacement of the pendular limb (dotted lines) as a function of time for four different pendulum lengths. (A) With feedback. Notice that tlie torque generally leads the angular displacement of the pendulum by 90.. For L = 0.4 and L = 0.8, the phabe lead varies from cycle t o cycle between 90 and over 180". Also, notice how the frequency and amplitude of the pendulum's oscillations decrease as tlie length of thc pendulum increases. The feedback signal used in these simulations was the full-wave rectified angular position of the pendulum. (B) Without feedback. The torque is in-phase with the angular displacement of tlie pendulum. Only the amplitude of the pendulum's oscillations decreases as the length of the penduluin increases. The frequency is fixed and determined by the endogenous frequency of the CPG oscillator: 0.16 Hz.
Coupling Neural and Physical Dynamics
575
differ only by the factor, G ) generally lead the pendulum's displacement by a phase of approximately 90" (Fig. 2A). The significance of this phase lead is mentioned in the Discussion. For a pendulum length of 0.4 and 0.8 m, the relative phase between the two oscillators changes from cycle to cycle. The phase lead of the CPG output varies between 90 and over 180". Also, the amplitude and frequency of the two oscillators vary from cycle to cycle. The average frequency and amplitude of the pendulum's oscillations decrease as the length of the pendulum increases. In the absence of feedback, the two oscillators are nearly in-phase for all pendulum lengths (Fig. 2B). Notice that the amplitude of the pendulum's oscillations decreases as the pendulum length increases. In contrast, the frequency remains fixed at 0.16 Hz because the pendulum is acting like a passively driven system.
3.3 Resonance Tuning. I performed another set of simulations to investigate the ability of the coupled system of oscillators to tune its frequency to the resonant frequency of the pendular limb. The length of the pendulum was decreased so that its resonant frequency increased by a factor of 3. The average frequency at which the coupled system (i.e., both the CPG and the pendular limb) settled is plotted as a function of the resonant frequency of the pendular limb for three different values of the muscle gain, G (Fig. 3A, with feedback). Notice that the average frequency of the system scaled with or nearly matched the resonant frequency of the pendulum. Although resonance matching required a particular set of parameter values, resonance scaling was a robust phenomenon as long as the resonant frequency of the pendular limb was higher than the endogenous frequency, WO, of the CPG oscillator. If the resonant frequency of the limb approached (or fell below) the endogenous frequency of the CPG oscillator, two phenomena occurred. First, the system entrained to the endogenous fequency of the CPG oscillator. For example, if the endogenous frequency was set to 1.6 Hz (wg = 10.0), the pendular limb settled at that frequency despite variations in the resonant frequency when the resonant frequency fell below the endogenous frequency (Fig. 3B, hollow circles). Second, the system often entered a regime of subharmonic entrainment in which the pendular limb swung at its resonant frequency while the CPG oscillated at an integer multiple of the pendular limb's frequency. Resonance tuning also broke down at resonant frequencies much larger than the endogenous frequency of the CPG (Fig. 3B, all three simulations). Resonance scaling was observed for muscle gains that varied by as much as one order of magnitude and for feedback gains ranging from under 5 up to 50. When feedback was removed ( B = 0), the pendular limb acted like a passively driven system whose frequency was dictated by the CPGs frequency: the limb frequency remained constant as the resonant fre-
Nicholas G. Hatsopoulos
576
A
B 40,
-
1
0 0 1
0 2 0 3 0 4 05 06 0 7 08 endogenwur frequency of CPG (Hz)
09
Figure 3: (A) The frequency of the coupled system as a function of resonant frequency of the pendular limb without feedback (open circles) and with feedback for three different muscle gain values (solid circles, G = 0.5; solid squares, G = 0.8; solid triangles, G = 1. l ) . The endogenous frequency is held at 0.16 Hz. (B) The frequency of the pendular limb as a function of resonant frequency for three different endogenous frequenices. Notice how the pendular limb entrains to the endogenous frequency when the resonant frequency is comparable to or smaller than the endogenous frequencies [&,o = 5.0 (0.8 Hz) and 'UO = 10.0 (1.6 Hz)]. (C) The frequency of the coupled system as a function of the CPG's endogenous frequency without feedback (open circles) and with feedback (solid circles). The resonant frequency of the pendular limb is held at 0.56 Hz. The error bars represent standard deviations of the frequency due to variability from cycle to cycle.
Coupling Neural and Physical Dynamics
577
quency increased (Fig. 3A,without feedba~k).~ It is important to note that resonance tuning required that sensory feedback modulate the endogenous frequency of the CPG oscillator. When the feedback equation was modified such that proprioception from the pendulum acted on the CPG oscillator as a driving torque as opposed to a frequency modulator, the frequency of the coupled system remained fixed and did not vary with the resonant frequency of the pendulum. That is, the two oscillators always entrained to the endogenous frequency of the CPG oscillator. I performed a complementary set of simulations in which the resonant frequency of the pendular limb was held fixed while the endogenous frequency, wo, of the CPG oscillator was increased (Fig.3C).The frequency of the coupled system remained relatively constant near the resonant frequency as the endogenous frequency was increased. On the other hand, the frequency of the system tracked the endogenous frequency of the CPG oscillator when sensory feedback was removed. 3.4 Frequency Control. This model raises the question of how the nervous system can influence the oscillation frequency of the pendular limb if its frequency is so resistant to changes in the CPG’sendogenous frequency. There are at least three methods by which the CPG oscillator might control the frequency of the pendular limb. First, I have shown how the muscle gain, G, can affect both the frequency and amplitude of the motor rhythm. Second, modulating the stiffness of the joint (k in equation 2.2) will change the frequency of the system. Joint stiffness can be changed either by changing the stiffnesses of individual muscles or by co-contracting agonist-antagonist pairs of muscles. In either case, an increase in joint stiffness will increase the resonant frequency of the hybrid, spring-pendulum system and, therefore, increase the frequency of the pendular limb. Simulations have demonstrated the feasibility of such a method. Third, if sensory feedback can be gated by the nervous system, the pendular limb will become a passively driven system whose frequency will track the endogenous frequency of the CPG (see Fig. 3C, without feedback). 4 Discussion
I have presented the results of a number of simulations to support the idea that rhythmic movements emerge from the interaction of the neural dynamics of the nervous system and the physical dynamics of the periphery. In particular, I have shown that modeling feedback as a parametric coupling of the CPGs frequency generates the amplitude-frequency relationship observed experimentally (Williamson and Roberts 1986). 70n occasion, subharmonic entrainment was observed even without sensory feedback.
578
Nicholas G. Hatsopoulos
I have demonstrated that the relative phase between the CPG output and the pendulum's angular displacement with sensory feedback remains generally invariant at 90" despite variations in the system's frequency induced by changes in the pendulum length (see Fig. 2A). The phase transfer function of a second-order system such as a linearized pendulum varies from 0 to 180" as the driving frequency increases. In fact, the phase transfer function is particularly sensitive to changes in driving frequency near 90'. So how does the CPG output maintain a 90" phase lead relative to the pendulum's displacement? Proprioceptive feedback allows the central pattern generator to tune its frequency to the resonant frequency of the periphery (see Fig. 3A). The phase transfer function of a second-order system attains a value of 90" at its resonant frequency. On the other hand, the two oscillators are in-phase in the absence of sensory feedback because the endogenous frequency of the CPG oscillator without feedback is much lower than the resonant frequency of the pendulum (.4/2a = 0.16 Hz) in accord with the phase transfer function. Resonance tuning seems to be a general feature of a limit-cycle oscillator coupled to a pendulum with this form of parametric feedback. If the viscous term in the van der Pol oscillator is modified by replacing y2 with y4, the system remains a limit-cycle oscillator, and resonance tuning is observed. I have also shown the existence of resonance tuning when the formal limit-cycle oscillator is replaced with a sixth-order, neural oscillator with limit-cycle stability (Hatsopoulos et 01. 1992). A simplistic way to understand how resonance tuning arises is to consider the interaction of the frequency response of the pendulum with the feedback equation. The frequency response of a second-order system relates its oscillation amplitude to the driving frequency and has a peak at its resonant frequency. The feedback equation determines indirectly how the frequency of the CPG Le., the driving frequency) will depend on the oscillation amplitude of the pendulum because the equation relates the endogenous frequency of the CPG oscillator to the instantaneous displacement of the pendulum. In the case of the modified feedback equation involving the rectified displacement, the larger the amplitude of the pendulum, the larger the temporal average of the pendulum's rectified displacement and, therefore, the CPG's endogenous frequency. Therefore, the frequency and amplitude at which the pendulum equilibrates are determined by the intersection of its frequency response with the curve relating the frequency of the CPG with the amplitude of the pendulum. As the resonant frequency of the pendulum increases, the frequency response curve shifts horizontally along the frequency axis. Therefore, the intersection point will also shift to a higher frequency. This is similar to the economic concept that the price of a product depends on the intersection of the supply and demand curves. Resonance tuning is an important property for at least two reasons. First, driving a physical system at its resonant frequency requires a min-
Coupling Neural and Physical Dynamics
579
imum driving torque for a fixed movement amplitude. This implies that the metabolic costs of driving a limb at its resonant frequency are minimized. This can be viewed in another way, Since the torque leads the displacement of the pendular limb by 90", the time-integral of the product of the external torque and angular velocity over a period of the rhythm (i.e., the work done by the muscles on the pendulum) is maximized. Second, resonance tuning stabilizes the movement by making the movement frequency resistant to fluctuations in the endogenous frequency of the CPG. This model of rhythmic motor control makes a number of predictions that can be tested experimentally. First, under most conditionss the net torque should lead the displacement of the limb by 90". Hatsopoulos and Warren (1995) provided evidence for this in human arm movements about the elbow joint. Second, feedback should act to stabilize the motor rhythm's frequency. Some evidence for this comes from work on motor rhythms in the eel (Mos and Roberts 1994). Third, it suggests that movement frequency should increase with joint stiffness (Latash 1992). Finally, movement frequencies away from resonance could also involve the gating or blocking of sensory feedback.
Acknowledgments This research was partially funded by a grant from the Office of Naval Research to Gilles Laurent. I thank Micah Siege1 and William H. Warren for their very helpful comments on this manuscript.
References Alexander, R. M., and Jayes, A. S. 1983. A dynamic similarity hypothesis for the gaits of quadrupedal mammals. 1.Zool. London 201, 135-152. Bassler, U. 1983. Neural Basis of Elementary Behavior in Stick Insecfs. SpringerVerlag, Berlin. Bernstein, N. A. 1967. The Coordination and Regulation of Movements. Pergamon Press, London. Delcomyn, F. 1980. Neural basis of rhythmic behavior in animals. Science 210, 492498. Fel'dman, A. G. 1966. Functional tuning of the nervous system during control of movement or maintenance of a steady posture. 11. Controllable parameters of the muscle. Biophysics 11,565-578. Greenewalt, C. H. 1975. The flight of birds. Transact. A m . Philos. Soc. 65, 21-23. Grillner, S., and Zangger, P. 1979. On the central generation of locomotion in the low spinal cat. E x p . Brain Res. 34, 241-261. KTheremay be particular circumstances in which sensory feedback is gated so that rhythmic movement can occur away from resonance.
Nicholas G. Hatsopoulos
580
Hatsopoulos, N. G. 19Y2. The coupling of neural and physical dynamics in motor control. Unpublished doctoral dissertation, Brown University, Providence. Hatsopoulos, N. G., Warren, W. H., and Sanes, j. N.1992. A neural pattern generator that tunes into the physical dynamics of the limb system. I f i t . loitit Corif’.Neiirel N e f i t w k s ‘92, I, 104-109. Baltimore, MD. Hatsopoulos, N. G., and Warren, W. H. 1995. Resonance tuning in rhythmic arm movements. I . Motor Bt,/inii., in press. Hogan, N. 1981. Adaptive control of mechanical impedance by coactivation of antagonist muscles. ZEEE finrisoct. Aictoriintic Coritrol 29, 681-690. Holt, K. G., Haniill, J., and Andres, R. 0. 1990. The force-driven harmonic oscillator as a model for human locomotion. Hiiriiiiri Mozwicril Sci. 9, 55-68. Hoy, M. G., and Zernicke, R. F. 1985. Modulation of limb dynamics in the swing phase of locomotion. /. Bioiiiecli. 18, 4 9 4 0 . Hoy, M. G., and Zernicke, R. E (1986). The role of intersegmental dynamics during rapid limb oscillations. J. BiuIiicck. 19, 867-877. Jordan, D. W., and Smith, P. 1989. Noriliricar Ordirinry Differcvitinl Eqiuztioiis. Clarendon Press, Oxford. Kugler, I? N., and Turvey, M. T.1987. I~iformtiou,Ni7tirrnl Lniu, iiiicl tlic SdfAssembfy of Rh!/thtnii- Moiwiwt. Lawrence Erlbaum, Hillsdale, NJ. Latash, M. L. 1992. Virtual trajectories, joint stiffness, and changes in the limb natural frequency during single-joint oscillatory movements. Neiiroscitme 49, 209-220.
Merton, P. A. 1953. Speculations on the servo-control of movement. In CZBA Foirrzrlntioii Sy??iposiiirii,T/icS;~irinlCliord, G. E. W. Wolstenholme (ed.), pp. 247255. Churchill, London. Miller, S., and Scott, P. D. 1977. The spinal locomotor generator. E s p . Brnin Rcs. 30, 387403.
Mos, W., and Roberts, 8. L. 1991. The entrainment of rhythmically discharging reticulospinal neurons of the eel by sensory nerve stimulation. J. Corrip. Physiol. A 174, 391-397. I’ennycuick, C. J. 7975. On the running of the gnu (Connochaetes taurinus) and other animals. /. ESP. B i d . 63, 773-799. Sclmeider, K., Zernicke, R. F., Schmidt, R. A,, and Hart, T. J. 1989. Changes in limb dynamics during the practice of rapid arm movements. J . Bioinrch. 22, 805-817.
Schneicler, K., Zernicke, R. F., Ulrich, 8. D., Jensen, J., and Thelen, E. 1990. Understanding movement control in infants through the analysis of limb intersegmental dynamics. J . Motor Belinr 22, 493-520. Seto, W. W. 1964. Tlzeciry nrid Problems of Mrclinirie~ilVihratioris. Schaum, New York. Shik, M. L., Severin, F. V., and Orlovsky, G. N. 1966. Control of walking and running by means of electrical stimulation of the mid-brain. Biophysics 11, 756-765.
Sotavalta, 0. 1954. The effect of wing inertia on the wing stroke frequency of moths, dragonflies, and cockroach. Anri. Eizt. Fe1711.20, 93-100.
Coupling Neural and Physical Dynamics
581
Stein, R. B. 1982. What muscle variable(s) does the nervous system control in limb movements? Behav. Brain Sci. 5, 535-577. Wall&, P., and Williams, T. L. 1984. Fictive locomotion in the lamprey spinal cord in vitro compared with swimming in the intact and spinal animal. 1.Physiol. London 347, 225-239. Williamson, R. M., and Roberts, B. L. 1986. Sensory and motor interactions during movement in the spinal dogfish. Proc. Royal SOC.London Ser. B 227, 103-11 9. Wilson, D. M. 1961. The central nervous control of flight in a locust. 1.Exp. Biology 38, 471490. Wilson, D. M. 1964. The origin of the flight-motor command in grasshoppers. In Neuronal Theory and Modeling, R. F. Reiss (ed.), pp. 331-345. Stanford University Press, Stanford, CA.
Received July 22, 1994; accepted August 14, 1995.
This article has been cited by: 2. Shane A. Migliore, Lena H. Ting, Stephen P. DeWeerth. 2010. Passive joint stiffness in the hip and knee increases the energy efficiency of leg swinging. Autonomous Robots 29:1, 119-135. [CrossRef] 3. Hui-Liang Jin, Qian Ye, Miriam Zacksenhouse. 2009. Return Maps, Parameterization, and Cycle-Wise Planning of Yo-Yo Playing. IEEE Transactions on Robotics 25:2, 438-445. [CrossRef] 4. Y. Futakata, T. Iwasaki. 2008. Formal analysis of resonance entrainment by central pattern generator. Journal of Mathematical Biology 57:2, 183-207. [CrossRef] 5. Evan H Pelc, Monica A Daley, Daniel P Ferris. 2008. Resonant hopping of a robot controlled by an artificial neural oscillator. Bioinspiration & Biomimetics 3:2, 026001. [CrossRef] 6. Aaron Raftery, Joseph Cusumano, Dagmar Sternad. 2008. Chaotic Frequency Scaling in a Coupled Oscillator Model for Free Rhythmic ActionsChaotic Frequency Scaling in a Coupled Oscillator Model for Free Rhythmic Actions. Neural Computation 20:1, 205-226. [Abstract] [PDF] [PDF Plus] 7. Carrie A. Williams, Stephen P. DeWeerth. 2007. A comparison of resonance tuning with positive versus negative sensory feedback. Biological Cybernetics 96:6, 603-614. [CrossRef] 8. Herbert Heuer. 2007. Control of the dominant and nondominant hand: exploitation and taming of nonmuscular forces. Experimental Brain Research 178:3, 363-373. [CrossRef] 9. Mario F. Simoni, Stephen P. DeWeerth. 2007. Sensory Feedback in a Half-Center Oscillator Model. IEEE Transactions on Biomedical Engineering 54:2, 193-204. [CrossRef] 10. Aymar Rugy, Robin Salesse, Olivier Oullier, Jean-Jacques Temprado. 2006. A Neuro-Mechanical Model for Interpersonal Coordination. Biological Cybernetics 94:6, 427-443. [CrossRef] 11. Fernando Corbacho, Kiisa C. Nishikawa, Ananda Weerasuriya, Jim-Shih Liaw, Michael A. Arbib. 2005. Schema-based learning of adaptable and flexible prey-catching in anurans I. The basic architecture. Biological Cybernetics 93:6, 391-409. [CrossRef] 12. Murat Sekerli, Robert J. Butera. 2005. Oscillations in a Simple Neuromechanical System: Underlying Mechanisms. Journal of Computational Neuroscience 19:2, 181-197. [CrossRef] 13. P.R. Bandyopadhyay. 2005. Trends in Biorobotic Autonomous Undersea Vehicles. IEEE Journal of Oceanic Engineering 30:1, 109-139. [CrossRef] 14. Hui-Liang Jin, M. Zacksenhouse. 2003. Oscillatory neural networks for robotic yo-yo control. IEEE Transactions on Neural Networks 14:2, 317-325. [CrossRef]
15. Hong Yu, Daniel M. Russell, Dagmar Stenard. 2003. Task-effector asymmetries in a rhythmic continuation task. Journal of Experimental Psychology: Human Perception and Performance 29:3, 616-630. [CrossRef] 16. Michael A. Riley, M. T. Turvey. 2002. Variability and Determinism in Motor Behavior. Journal of Motor Behavior 34:2, 99-125. [CrossRef] 17. Shigeru Kuriyama, Yoshimi Kurihara, Yusuke Irino, Toyohisa Kaneko. 2002. Physiological gait controls with a neural pattern generator. The Journal of Visualization and Computer Animation 13:2, 107-119. [CrossRef] 18. Lon Goodman, Michael A. Riley, Suvobrata Mitra, M. T. Turvey. 2000. Advantages of Rhythmic Movements at Resonance: Minimal Active Degrees of Freedom, Minimal Noise, and Maximal Predictability. Journal of Motor Behavior 32:1, 3-8. [CrossRef]
Communicated by Ning Qian
A Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot Stereograms Christopher W. Lee Washington University School of Medicine, St. Louis, M O 63110 U S A Bruno A. Olshausen Washington University School of Medicine, St. Louis, M O 63110 U S A and California Institute of Technology, Pasadena, C A 91125 U S A An intrinsic limitation of linear, Hebbian networks is that they are capable of learning only from the linear pairwise correlations within an input stream. To explore what higher forms of structure could be learned with a nonlinear Hebbian network, we constructed a model network containing a simple form of nonlinearity and we applied it to the problem of learning to detect the disparities present in random-dot stereograms. The network consists of three layers, with nonlinear sigmoidal activation functions in the second-layer units. The nonlinearities allow the second layer to transform the pixel-based representation in the input layer into a new representation based on coupled pairs of left-right inputs. The third layer of the network then clusters patterns occurring on the second-layer outputs according to their disparity via a standard competitive learning rule. Analysis of the network dynamics shows that the second-layer units' nonlinearities interact with the Hebbian learning rule to expand the region over which pairs of left-right inputs are stable. The learning rule is neurobiologically inspired and plausible, and the model may shed light on how the nervous system learns to use coincidence detection in general. 1 Introduction
In recent years, linear Hebbian learning rules have been used to model the development of receptive field properties in the central nervous system (e.g., Linsker 1988; Miller ef al. 1989; Sereno and Sereno 1991; Berm ef al. 1993). These networks have many attractive features: they discover structure in input data, reduce redundancy, and perform principal component analysis (Hertz et al. 1991, pp. 197-215). Significantly, these models have for the most part ignored the nonlinearities inherent in real neurons (for an exception see Miller 1990). It is important to develop sound theories for the roles nonlinearities might play in such unsupervised neural networks. However, "good theories rarely develop outside the context of a background of well-understood real problems and special cases" Neural Computation 8, 545-566 (1996)
@ 1996 Massachusetts Institute of Technology
546
Christopher W. Lee and Bruno A. Olshausen
(Minsky and Papert 1988, p. 3). Thus, we have chosen to study in detail the problem of disparity detection for the extraction of surface depth from stereoscopic images. This problem is particularly appropriate because (1) disparity has known behavioral and neurobiological relevance, (2) psychophysicists have shown that random-dot stereograms provide simple, mathematically well-defined stimuli that capture the essence of the disparity problem (Julesz 1971),and ( 3 )disparity processing has been proven to require nonlinearity for its implementation (Minsky and Papert 1988, pp. 48-54). For these reasons we created a network containing simple, nonlinear units that can learn to detect disparity in random-dot stereograms under biologically inspired, Hebbian learning rules. In this paper, we present a description of this network followed by an in-depth characterization of its learning and performance. 2 Inspiration from Neurobiology
Several basic characteristics of neural signaling shape our approach to modeling a nonlinear network. A first, simple form of nonlinearity is inherent in the neurobiology of synaptic transmission: a real synapse is either excitatory or inhibitory, whereas a model linear synapse may change from one to the other.’ Another nonlinear aspect of synaptic transmission is long-term potentiation (LTP). LTP is the increase in synaptic efficacy that occurs between active pre- and postsynaptic neurons. The phenomenon expresses three basic properties in relating presynaptic activity to changes in synaptic strength: input specificity, associativity, and cooperativity. LTP is input specific in that nonactive synapses are not potentiated during induction of LTP. Associativity refers to the cell’s ability to potentiate a weak input if it is paired with a simultaneously active strong input. Finally, “cooperativity describes the existence of an intensity threshold for induction” (Bliss and Collingridge 1993); in other words, a critical number of afferents must be active for induction of LTP (McNaughton ef nl. 1978). By contrast, liwnr Hebbian rules are not cooperative in this sense. For a linear neuron, Hebb’s rule states only that the change in synaptic strength is proportional to the postsynaptic activity times the presynaptic activity. This rule is associative and input specific, but has no threshold for induction: a single weak input repeated twice achieves the same level of synaptic enhancement that coactive inputs would achieve in one step (cf. Holmes and Levy 1990). Thus, LTP is Hebbian, but not linear.2 ‘Here we consider only fast synaptic transmission. ’Previously, Miller 11990) addressed some of these sources of nonlinearity by studying a system with two populations of on-off cells, developing an analytical framework based upon linearizing the difference between these two input sources. We have chosen a different approach by incorporating these properties of synaptic rectification and cooperativity directly into a neural unit.
Nonlinear Hebbian Network
547
Figure 1: Form of the nonlinear function 0.Shown are graphs of U ( X ) for typical values used in our simulation, p = 5, 10, and 20. Dotted lines indicate the activation levels given by one or two synapses with maximal synaptic weights. 3 Mathematical Formalism
We incorporate the above properties into a neural unit based upon one of the standard model units commonly used in neural network models: the summing unit with sigmoid-shaped activation function. We define this sigmoid unit by its input-output relation and learning rule. Let y, denote the unit output and let xl denote the set of inputs to the unit. Then, if each connection strength is represented by a weight, w,,, the output of the unit is given by (3.1) where a(x) is a sigmoid-shaped function with a bias of 1/2 so that for zero input the output is zero. Specifically, 1
c(X) =
{o
x>o x
(3.2)
where ,O determines the steepness of the slope at x = 1/2 (see Fig. 1). We also assume that the inputs are normalized to take on values from zero to one.
Christopher W. Lee and Bruno A. Olshausen
548
For our learning rule we use a standard form of Hebbian learning (Linsker 1988) with synaptic changes, Azu,,, given by
h i , ,
x (output) x (input)
(3.3)
where o is a parameter describing the amount of heterosynaptic competition among synapses (0 < o < l). To ensure consistency with the one-sided nature of (excitatory) synaptic transmission, the weights are allowed to take on values only greater than or equal to zero. Cooperativity in the learning rule comes from the sigmoidal nonlinearity. To maintain a “threshold” for weight increases, a single active connection should not be able to affect the unit strongly. Therefore, we must set an upper limit, ~ L T , ~ , , ,on the strength of any one connection. We choose zu,,, so that two strong inputs can have a strong effect on the unit [r7(2zun,,,) = 11, but a single input can only weakly affect the unit [t7(zum,, 1 << 11. Specifically, we set iu,,,, = 1/3. (The responses resulting from one or two inputs are superimposed on the graph of Figure 1.) In effect, we set our threshold to discriminate between states with one active input and states with two or more active inputs. To get a feeling for how the weights will evolve, one can qualitatively describe the behavior of a single unit as follows: As a series of inputs is presented to the unit, the synapses will ”compete” among themselves to maximize their strengths due to the heterosynaptic depression term, 0. Over time, some synapses will begin to win out and others will be suppressed. However, unlike the linear case, a single synapse will be much less likely to dominate all the others because a single input, acting alone, is not able to induce a substantial change in the unit’s response, and, hence, it is also unable to make a change in the unit’s synapses. Strong synaptic modulation requires at least two inputs, and two inputs that are active at the same time will, on average, strengthen their weights more than the competition between the two inputs weakens the weights. In other words, it ”pays” to cooperate, and so synaptic competition becomes a competition between pairs of inputs. On average, the pairs whose inputs are statistically correlated will have an advantage in that competition. How much does it pay to cooperate? First, let us define the ratio, R = ~ ( ~ z L ~ , , , , ) / ( T ( z L ~ , , , ) . Now, assuming that a pair of inputs with synaptic strengths (zumax. )z,o, has already evolved, we can ask what it would take to destabilize the pair. On average, for each simultaneous, paired firing event, one synapse would have to fire without the other approximately R times to destabilize the pair. Note that the linear Hebb rule has R = 2, and thus shows a relatively small preference for pairings.
Nonlinear Hebbian Network
549
4 Simulation: ”Learning Disparity”
We define our version of the problem of ”learning disparity” as follows: Given a set of one-dimensional, random-dot stereograms, create a set of neural units that learns to become tuned selectively to the disparities present in the input. Random-dot stereograms have often been used as tests for disparity algorithms (e.g., Marr and Poggio 1975; Becker and Hinton 1992). This is, in part, because the lack of higherlevel visual cues forces the algorithm to deal with the problem of false matches between left and right image elements. However, this lack of structure-specifically, the lack of correlations between pixel elements within each eye field-can simplify the learning process because it leaves only disparity-based structure in the input. We make use of the nonlinear units defined above in a three-stage network that learns to solve this problem (illustrated in Fig. 2). At the first layer, the inputs are assumed to be one-dimensional, binary images from the left and right eyes, which we denote xL and xR, respectively. The second stage consists of the sigmoid units, with outputs y, and connection to the inputs XI, where e takes on values L and R. Following weights w;, equation 3.1, the outputs, y,, are given by (4.1)
Each unit has connections to an equal number of inputs from the left and right eyes corresponding to the same region in the visual field. For ease of analysis, the inputs to each unit are chosen so they do not overlap.3 In the third layer, we use a variant of a standard clustering network whose properties have been well characterized (Rumelhart and Zipser 1985; Hertz et al. 1991, pp. 217-219). Each unit, zk, receives inputs from all the yj weighted by synaptic efficacies Vkj; then a winner-take-all competition takes place among the third layer units to determine their final output. Only the winning unit changes its synaptic weights via a Hebb rule while the other units’ outputs are set to zero. Thus, the third layer follows the equations:
(4.3)
avkj
zk(yj-$)
(4.4)
After the winner adjusts its weight vector in the direction of the current Vkj = 1. The subinput vector, the weights are renormalized so that
cj
3We have also shown that overlapping starting receptive fields can be used in conjunction with lateral inhibition to achieve a similar result (unpublished data).
Christopher W. Lee and Bruno A. Olshausen
550
lefi eye
right eye
XL
XR
input image
\I
non-overlapping input fields
Figure 2: Model network. (a) A sigmoid unit y,, receives a small number of inputs from the left and right eye first-layer units (xL and xR, respectively). The input units are arranged with the left eye units stacked on top of the right eye units so that the two images are in register. (b) The architecture of the model network. The sigmoid units in the second layer have nonoverlapping input fields. A third layer of units, zk, is connected to the sigmoid units, y, though weights Vk,. The third layer units compete in a winner-take-all manner. The weights from the input to second layer and from the second to third layer evolve according to Hebbian equations 3.4 and 4.4.
tractive term, v, is added to help sharpen the competition within the weight vector (see also Goodhill a n d Barrow 1994). We train the network o n a sequence of random-dot stereograms at three different disparities. On each trial, the bits in the left eye image are
Nonlinear Hebbian Network
551
set randomly with bit probability p . The right eye image is then copied from the left eye image, shifted by an amount d:
where the disparity, d, is a randomly chosen integer from the set { - 1 , O . l}, We simulated the model using a second layer composed of 18 sigmoid units, with each unit connecting to five inputs from each eye. The inputs to each of these units corresponded to a separate field within the input array, and a one-pixel border was used to separate each input field from the next. The third layer of the network consisted of three units, each having connections to all of the first-layer units. The weights to the second and third layer were set to random values, and both sets of connections were allowed to evolve simultaneously according to equations 3.4 and 4.4.
5 Results
After training, the output-layer units had each learned to respond selectively to stereograms with different disparities. Results were similar for parameter values in the ranges: 5 5 ,8 5 20, 0.65 5 q5 5 0.85, and 0.0 5 1c, 5 0.25, and the results appeared insensitive to the type of initial conditions. Figure 3a shows a snapshot of the initial state of the network before learning. Figure 4a and b shows snapshots of the network at progressive stages of learning. In Figure 4a, the sigmoid units are beginning to become tuned to specific coincidences between left and right image pixels. In Figure 4b, this process has completed and the third Iayer neurons have become tuned to specific disparities over the entire input space. These examples represent typical results for these parameters. Because each second-layer unit has an approximately equal chance of becoming tuned to a disparity of either +1, 0, or -1, the number of units in the second layer tuned to a specific disparity varies from one simulation run to the next. To illustrate the overall performance of the converged state of the network, we arranged the y p so that their 18 first-layer receptive fields for each eye were stacked in two adjacent columns of nine. Thus, for each eye, the network's kernel has a rectangular domain of size 10 x 9 [(5 x 2) x 91 pixels that represents the 18 component receptive fields of 5 pixels apiece. We then convolved the network, thus arranged, with a random-dot stereogram. The result is shown in Figure 5. The output of the third layer is represented by plotting a different color pixel depending on which third-layer unit won. The network segregates a 0 disparity square (green) from a -1 disparity background (blue).
552
Christopher W. Lee and Bruno A. Olshausen
6 Analysis
Here we examine how each part of the network contributes toward the ability to extract disparities. The network relies upon the second layer units’ ability to develop weights corresponding to disparity pairs. Each sigmoid unit acts as a coincidence detector; its output essentially computes a logical AND, or pseudo-multiplication, of two inputs (xJ,) when is large. (In the analysis that follows, we will always assume that 11 is large enough to be in this regime.) The second layer therefore consists of an array of primitive, location-specific disparity detectors. The third
Figure 3: Initial state of the simulation. (a) A snapshot of the initial state of the network is shown with random connection strengths, zo; t [0.0.1] and Vk, E [0.01.1]. The architecture is the same as in Figure 2. Connection strengths Vkl between the second and third layer units are indicated by line thickness. Connection strengths 111); are indicated by the size of the filled rectangle in the small boxes beneath each second layer unit, with a completely filled box indicating a connection of maximum strength. Activities of the units are indicated by shades of gray. The gradient bar shows the scale from zero (white) to one (black) of the unit activities. The input layer, labeled by xL and xR, shows a +I disparity image. (b) A detailed illustration of how the connection strengths zo,; are represented showing the correspondence between the depiction of weights in this and the previous figure.
Nonlinear Hebbian Network
553
Figure 4: Simulation evolution. (a) The state of the network after a few hundred iterations. The nonlinear units in the second layer begin dropping their inputs. Some have already settled on two inputs, while others are still converging. (b) The final state of the network after several thousand iterations. All the sigmoid units in the second layer have eliminated all but two inputs-one from each image-making the unit crudely selective for disparity. The units in the third layer have successfully learned to group together the first-layer units signaling the same disparity. Weights values, wY,, and activities are indicated by filled boxes and grayscale as in Figure 3: white = 0.0, black = 1.0. The strengths of the weights, vkl, are denoted by the line thickness. The parameters used were /3 = 10, 4 = 0.70, and $ = 0.25. layer integrates across this array, effectively performing a cluster analysis on the variables yo = (xjaxja),where the y0s are the result of a selective sampling of the space of all multiplicative pairs of inputs. The network a s a whole performs a generalized form of clustering, analogous to techniques used in statistics in which a set of variables is transformed before
554
Christopher W. Lee and Bruno A. Olshausen
Figure 5: Performance of the network on a random-dot stereogram. The network’s input fields were arranged as described in the text, then convolved with a random-dot stereogram (top) containing a square of U disparity upon a -1 disparity background. (below) The output of the network at each position is shown coded by color. Legend: blue, green, and red indicate -1, 0, and +1 disparities, respectively. applying a standard procedure such as principal component o r cluster analysis.
6.1 Evolution of the Second-Layer Units. Coincidence detection in the second-layer units depends upon the evolution of a weight vector with two nonzero components that are matched to o n e of the disparities
Nonlinear Hebbian Network
555
Figure 6: The effect of @ on the stable fixed points of the system. The number of nonzero components, r, in the stable points of (a) a three-dimensional linear system with y = ,8w.x, and (b) a three-dimensional nonlinear system for ,8 = 10. Note that the nonlinearity has enlarged the r = 2 region. These values are based upon computer simulations with random initial conditions and presentations as described previously, with the exception of the value of yDC in the threedimensional linear system that was solved for directly (see Appendix). in the input, i.e., a vector with wf = w $ ~= wmax, for d = - l , O , or 1, with all other w,= 0. (We will refer to this weight configuration as a "disparity pair.") Strictly speaking, a nonlinear activation function is not required for development of this weight configuration. A modified linear system can develop disparity pairs when equipped with synaptic positivity (0 5 w,5 wmax)and subtractive constraints (the 4 term in our nonlinear system, see Miller and MacKay 1993 for an analysis of this form of constraint). For example, if instead of equation 3.1, we substitute for the unit output y = pw x b, then under the same Hebb rule as equation 3.4, the system will evolve disparity pairs for b = 0, p = 5, and q5 = 0.7. Other combinations of p, b, and 4 will also suffice. We observed that a nonlinear activation function offers one advantage over a linear function in the process of developing disparity pairs: the nonlinear system converges to paired weights for a larger range of the parameters p and 4 than the linear system. This occurs because the linear system must balance these parameters carefully. q5 must be large enough to eliminate synapses while not so large that it eliminates one member of a disparity pair. By contrast, the cooperativity inherent in the sigmoid function selectively stabilizes pairs. Therefore may be larger than for a similar linear system while still converging to a pair. An illustration of this enlargement is shown in Figure 6. In the limit as /3 + oc), single
+
556
Christopher W. Lee and Bruno A. Olshausen
synaptic inputs produce no activation whatsoever, so that any weight vector that begins with greater than one component can never have less than two components. It is possible to estimate the critical value (tc above which d must be set to develop pairs in our system. For a system, with probability, p , of an input pixel being active, this is given by
4c
2 GZ
+ 3p
5
(6.1)
(see Appendix for a derivation). Note that the potential for enlargement of the 4 parameter regime as compared to the linear system increases as the inputs become sparser. An upper bound for q5 is harder to define for reasons that bring us to the last issue concerning weight evolution: why the sigmoid units develop disparity pairs preferentially over nondisparity pairs. With the proper parameters, the sigmoid units pick out disparity pairs exclusively; that is, they become sensitive to shifts of $1.0, and -1, but not +2 or -3, for example, since these disparities do not appear in the input. Members of a disparity pair fire together more often than the members of nondisparity pairs, but only slightly: for p = 1/2, the probability of a disparity pair of inputs firing together is 1/3 while the probability for a nondisparity pair is 1/4.4 From our simulations we see that this is enough to allow the sigmoid units to select disparity pairs exclusively for values of 4 near to 4‘; but as 4 is increased the units begin to pick out nondisparity pairs occasionally. For example, in simulations with 4 = 0.7 the sigmoid units always pick out disparity pairs, while for 4 = 0.85,9% of the units (eight out of ninety) converged to nondisparity pairs. (Other parameters were kept equal to those used in Fig. 3.) This may occur because with the greater rate of weight reduction for higher 4, the system does not have as much time to sample the input ensembleallowing fluctuations in the input or in the initial weight configuration to have a greater effect in picking out an otherwise unfavored pair. 6.2 Clustering. Following the analysis of Rumelhart and Zipser (1985) for the ri/ = 0 case, the third layer acts as a clustering mechanism that partitions the outputs from the second layer into compact regions. This relationship can be expressed graphically by projecting the input to the clustering layer onto the surface of a sphere: as shown in Figure 7. Clusters are formed based upon the distance between points on the sphere, and the weight vectors Vk describe the cluster midpoints. The axes for this three-dimensional subspace are defined so that each general, the ratio of these quantities, P(disparity pair)/P(nondisparity pair), equals (I + 2p)/3y, indicating that learning is facilitated by sparse inputs (cf. equation 6.1; see also Field 1994). “he constraint C, V , = 1 actually defines a plane, but we project to a sphere for consistency with the geometric analogy used in Rumelhart and Zipser (1985). Either projection leads to the same conclusions in what follows.
Nonlinear Hebbian Network
557
Figure 7: Distribution of first-layer outputs projected onto the sphere. Each symbol on the sphere represents the output of the second layer in response to one of the 200 stereograms presented to the network. The symbol shape corresponds to the actual disparity of the stereogram used to generate that point: open triangles, filled triangles, and gray circles represent disparities of +1, 0, and -1, respectively. The vector y was projected onto a sphere by reordering the basis used for the y, into three groups. The first group corresponded to y,s that tune for +1 disparity, the second group corresponded to those that tuned for 0 disparity, while the last group corresponded to -1 disparity. A three-dimensional subspace was defined by axis directions (1,1,. . . ,1;0. . . . , 0; 0, . . . ,0), (0, . . . , 0; 1 , l.~. . , 1;0. . . . .O), and (0,. . . ,O; 0,. . . .O; 1.1.. . . ,1) where semicolons indicate the division between groups. The sphere corresponds to the unit sphere embedded into this subspace. Small xs mark the predicted center of mass for each symbol.
Christopher W. Lee and Bruno A. Olshausen
558
corresponds to a different disparity, and the unit sphere is embedded into this subspace as explained in the caption to Figure 7. If we project the output of the second layer of our network over 200 presentations onto the surface of this sphere, the points distribute to form a triangle (Fig. 7). The natural clusters for data distributed evenly over a triangle are the three corner regions. (To see this, create the Voronoi diagram that divides up the points equally for the triangular region using three polyhedrons.) Thus, for .c'I = 0 the Vk, will stably align with the three corner regions of the triangle (in the vicinity of the Xs in Fig. 7) because these directions minimize cluster size. The q become disparity selective because the corners of this triangle also correspond to the different disparities. This structure can be seen by looking once again at Figure 7 where each point is labeled according to the disparity of the stereogram that generated it with either a filled triangle, open triangle, or gray circle. The center of mass of each symbol is strongly biased toward one of the corners of the output distribution, reflecting the different expectations in the y, given the disparity of the input. That is, for p = 1/2: given D = d , given D # d,
(6.2)
where d, is the disparity for which yl is best tuned. In this way, positionindependent disparity tuning results from the geometry of the secondlayer responses to stereograms.6 Allowing $1 > 0 has little effect on the clustering actually performed by the network, though it makes it easier to see and analyze the clusters by "extremizing" the weight vectors-i.e., the weight vectors are moved away from the interior of the triangle and toward the corner vertices. These spherical plots also illustrate why a linear second layer is inadequate for producing disparity selectivity. Figure 8 shows the output of a second layer using the same 200 input patterns and identical weights as in Figure 7, but with a linear activation function. Note that there is no structured shape to the distribution and that the different disparities are completely intermixed. Correspondingly, for the linear case, the average activities of the y, do not differentiate between disparities, that is, for yj = pw . x, (yJ) =
{
pw,,, pw,,,
given D = d, given D # d,
(6.3)
Thus in this case, the vk, can never reach a stable equilibrium. Instead, as might be predicted from Figure 8, the vkj cycle continuously. 6Note, this geometry is contingent upon having spatially separate receptive fields. The clusters that confer position-independent disparity would be disrupted by the strong correlation in firing between the two units with overlapping fields. Thus, for this form of model to be effective, some mechanism-such as lateral inhibition or, as in the model described in this paper, direct subdivision of the input array-must exist to keep the units spatially decorrelated.
Nonlinear Hebbian Network
559
Figure 8: Distribution of linear first-layer outputs. Spherical plot generated in a manner identical to Figure 7, using the same 200 stereograms, except that the output of the second layer was based upon a linear activation function rather than a sigmoidal one.
6.3 Network Performance. We can calculate the accuracy of the network in classifying stereograms. Consider the y, as binary random variables, with, for example, p = 112, P(y, = 1 1 D = dy,) = 1/2, and P(y, = 1 I D # dy,) = 1/4. Let GZ,} denote the set of indices into the second layer units to which z k connects strongly, and let Nk equal the number of elements in this set. At equilibrium with $5 > 0, we observe that z i = C, vk,y, = 1/Nk y,. For convenience, define the integer random variable Sk = (Nk.2;). These Sk are conditionally independent, binomially distributed random variables, and their values on a given trial
Christopher W. Lee and Bruno A. Olshausen
560
determine the winner among the third-layer units. When an image with disparity D is presented, the performance of a ZA in signaling the correct disparity is given by P(zk
= winner
I D = dZA)
For symmetrically distributed y, (i.e., N,= N,. Vi.j) this simplifies to P(zk = winner I D = d z A ) = c P ( S k = 1 I D = d z L ) I
x
n P(S,,, < I 1 D
= dzk)
(6.5)
rn#h
P(Sr
=1
I D = dz,,,) = B N /9'(1 ~ - 9)N"'
(6.6)
where dZ, is the disparity that zk is tuned to, denotes the binomial coefficient, and 9 = P(y,-, = 1 I D = d,,,,). This formula indicates that this type of network can classify stereograms with arbitrarily good accuracy by adding enough units. For p = 1/2 and a network with 18 first-layer units divided symmetrically among the zk, P(correct) = P(zk = winner I D = d z A )= 0.61 (versus 0.33 for chance). Accuracy improves gradually as the number of second-layer units increases, with the number of nodes scaling exponentially with increasing P(correct) for accuracies from 50 to 95%. (Ninety units are needed for a 95% accuracy level in our example network.) Likewise, to maintain fixed accuracy with an increasing number of disparities, ID(, the number of units must scale (approximately) as O(lDll0g PI). 7 Discussion
Our model demonstrates that a simple Hebbian network can learn to detect disparity. This is in contrast to all previous unsupervised models that have employed more complex and less biologically plausible learning schemes. In particular, it is instructive to compare our work with the work of Sanger (1989) and Becker and Hinton (1992). Sanger's network learns under a nonlinear extension of a "Generalized Hebbian Algorithm," which maximizes an observer's ability to reconstruct the network's input from its output. When presented with images derived from random-dot stereograms containing two disparities, the network learns to discriminate between these two disparities. Like our model, Sanger's network has a three-layer structure with simple (rectifying) nonlinearities in the second layer to generate the initial disparity sensitivity. However, learning occurs quite differently from our model in that once the second layer has converged, the disparity-sensitive units are identified and the
Nonlinear Hebbian Network
561
remaining units are discarded. The third layer is then trained solely on the outputs of these hand-picked units. In addition, Sanger’s network cannot be implemented to simultaneously allow for a simple feedforward network architecture and a local learning rule. Finally, though learning can be made local, Generalized Hebbian Learning requires that synapses on different neurons be constrained to maintain the same synaptic strengths (Hertz et al. 1991, pp. 206-209). A biological mechanism capable of enforcing this constraint has yet to be demonstrated. Becker and Hinton (1992) address a more complex version of the disparity problem in that their network learns to discriminate a continuous range of disparities and is also capable of representing spatial variations in disparity (i.e., curved surfaces). Their network is correspondingly more elaborate than ours. Learning is based upon the principle of maximizing the mutual information between groups of units viewing adjacent regions of visual space. In contrast with the simple Hebbian mechanisms used in our model, the weight update rule involves nonlocal backpropagation of the information signal. Both Sanger’s and Becker and Hinton’s approaches involve powerful learning algorithms that can be more easily generalized to other problem domains than our own because they are derived from well-defined optimality principles. Our network embodies a complementary approach, emphasizing the ease of analysis and simplicity of the learning rule in the context of a specific task. 7.1 Computational Issues and Neurobiological Relevance. A criticism of much work involving “toy problem” networks is that by failing to characterize their scaling properties, they tend to leave the impression that a problem has been ”solved” once and for all, leading investigators to neglect other approaches. In light of this, we point out that our network is computationally expensive to scale up to detect more disparities because network size scales as O(IDIlog(D().For example, a network that detects 20 disparity levels at 95% accuracy would require approximately 1000 second-layer units, a large number to require for even the most massively parallel devices. One of the reasons for this scaling inefficiency is that the steep sigmoid activation function on the subunits in the intermediate layer and the winner-take-all at the output layer effectively ”binarize” the unit’s responses, throwing out the information to be gained from a unit’s continuous output. In addition, the input layer’s pixel-based representation is far from optimal for the task. In the primate visual system, for example, the first stage of binocular image processing uses a monocular, spatially distributed input representation akin to a difference-of-gaussian (DOG), or possibly even orientation-tuned, set of filters. Disparity algorithms based upon this sort of representation both perform more robustly and scale more nicely for increasing disparity ranges than algorithms that use pixel representations alone (Qian 1994; Fleet ef al. 1991).
562
Christopher W. Lee and Bruno A. Olshausen
While our emphasis in this paper has been to keep the network simple to highlight the essential aspects of nonlinear Hebbian learning, we are currently working to extend the model defined here to use continuousvalued units and a spatially distributed representation. Our preliminary results indicate that it is difficult for a single layer of a Hebbian network to develop a DOG-like or Gabor-like representation while simultaneously developing binocular disparity sensitivity (see also Erwin i'f rrl. 1995). One Lvay around this difficulty appears to be the use of multiple network layers, with each layer learning one stage of the transformation. The brain uses such a multistage architecture to produce its representation, even though one layer would theoretically be adequate to implement the transformation. These observations, coupled with the scaling results set forth above, suggest that a multistage architecture may be a way o f dealing with one of the general problems in designing a neural computer: that, for a given task, there is a tradeoff between network fan-in, depth, and number of nodes that can affect both the networks ability to represent a given function and its ability to learn that function (cf. Minsky and Papert 1988). From this point of view, the brain's hierarchy of visual processing (Felleman and Van Essen 1991) embodies a workable compromise that balances the costs of adding more connections, more neurons, or more stages, so as to effectively mix parallel and serial modes of computation. Besides the lack of preprocessing mentioned previously, our network neglects the complexities of neural processing in the visual system in a number of respects. Real neurons in visual cortex receive input from thousands of synapses, have extensive and overlapping receptive field^,^ and are regulated by sophisticated adaptive gain control mechanisms (Carandini and Heeger 1994) when presented with realistic images. In addition, while our learning rule captures the early aspects of synaptic plasticity (Levy r f 01. 1990), more sophisticated learning rules (e.g., Bienenstock ef nl. 1982) are required to better approximate the neurobiology. Nevertheless, we believe that our model may capture some of the essential, nonlinear aspects of synaptic learning, and it serves to illustrate a general strategy available to the nervous system: how a layer of units may be used to nonlinearly transform an input so that a downstream layer can learn to discriminate higher-order features in the environment. Evidence for this computational strategy can be seen across species and across sensory modalities. In the monkey visual system (Poggio et a / . 19881, the bat echolocation system (Olsen and Suga 19911, and the sound localization system of the barn owl (Carr and Konishi 1990; Konishi 1992), nonlinear units act to transform sensory information into a format that explicitly represents coincidences within the input stream. Our results 'For an interesting model that deals with some of these issues and that is more closely related to the developmental neurobiology, see Berm ~t nl. (1993). Though their model does not possess the requisite nonlinearities to perform disparity detection, they generate the primitives of disparity tuning via a Hebbian-type mechanism that relies upon the use o f critical periods for development.
Nonlinear Hebbian Network
563
indicate that in addition to their function in processing signals, such coincidence detectors may also play a role in learning.
Appendix: Calculation of oC To estimate &, we rely upon a heuristic argument based upon a combination of observation, assumption, and approximation. The argument, while not completely rigorous, has given us useful rules of thumb for understanding the system, and it appears to be supported by our simulations. We begin by stating a few features of the system of equations defining the weights for a second layer unit.
Characteristics of the System. Perhaps the most important aspect we observe of the dynamical system, S , as defined by equation 3.4, is that, like related linear systems (Linsker 1988; MacKay and Miller 1990; Miller and MacKay 19941, it has no equilibrium points within the interior of hypercube domain of wI, as defined by 0 5 w, 5 wmax. All the stable points lie at the corners of the hypercube, which means that ultimately a w,becomes either 0 or w,,,. A corollary to this fact is that the dimensionality of S effectively decreases as various w,become zero. For the values of 4 that we are interested in, we can view the evolution of S as a passage through a family of dynamic systems, S,, where the subscript denotes the effective dimension of the system. Over time, the dimension is steadily reduced until a final stable point is reached, that is, Sinitla1 Sznitial-I . . . + Sfin,,.In this framework, our problem is reduced to finding that value of 4 which makes Sflnal = Sz. +
+
Stability Analysis. Let Y denote the effective dimension of a given system, i.e., for Si, r = i. Let w(')(t)denote the vector of weights w,that are not zero, and let n(') denote the r-dimensional vector, n:" = 1. An S , can be the final system in the chain only when the point w(')(t)= wmaXn(") is a stable point of the system S,. To analyze the stability of this point, we resort to a piecewise linear approximation of 0 .
where c1 = ( m - 1)/2m and c2 = c1 + m-I, and rn is the slope of the line chosen to match the sigmoid. We can motivate this approximation by observing that the sigmoid has an approximately linear region around input levels of one half, and by noting that, in simulations using this piecewise
Christopher W. Lee and Bruno A. Olshausen
564
function instead of 0,we saw similar behavior to those simulations using a smooth sigmoid.8 If we substitute this approximation for the linear regime into equation 3.4 and set the constant of proportionality to one, we get
2dt3 = m(Cwjxi
-
c,) (XI- 4)
(A.2)
Following MacKay and Miller (1990), we write the equations for the linear region of the activation function, average over the input, and write the result in matrix form [while suppressing the superscript ( Y)]: (W)
=
(mQ+k2J)w+kln
k2
=
"ZP(P
(A.3)
- 4)
(A.5)
Q is the covariance matrix ((x,- (x))(x, - (x))), J is the matrix J j j = 1, and n is defined by n, = 1 in the synaptic basis. Numerical solution for the fixed points of these systems, wFP,shows that it generally lies outside the hypercube [O,w,,,]' close to the axis defined by n. Given this, the stability of the hypercube vertex w,,,n is largely determined by the eigenvector (and its associated eigenvalue) of mQ + k2J, which has the largest component along the direction n. Call this eigenvector, ziDC and denote its eigenvalue yDC. When yDc is close to zero, the point w,,,n is unstable because the other orthogonal eigenvectors with larger eigenvalues dominate the trajectory, carrying w(f) away from that vertex.' We can solve directly for the value of 4 that makes yDc = 0 for Y = 3. For inputs consisting of a disparity pair and one non-correlated input, Q is given by
Q=
Let
[
P-p?
v3 - P3
e-P
'iP2
0 3
0
1
(A.6)
denote the value of q5 where yDc = 0. Then 4f) is given by
RAn exception to this remark occurs for those units whose activation cannot rise above zero because of the absolute threshold of our approximation. These units become "dead units." In the nonlinear case, the smooth lower leg of the sigmoid allows such unit's weights to increase. 'Note, these other, non-zlDC,eigenvectors are orthogonal because the matrix is symmetric. MacKay and Miller (1990) show these eigenvectors and eigenvalues are less affected by changes in 4. The stability of vertices comes from the shape of the hypercube: at the boundary trajectories tend to be projected toward the corners.
Nonlinear Hebbian Network
565
We use this value as our estimate for &. One might worry that while this value for dCmight destabilize S3 it might not destabilize other, higher for dimensioned systems. Direct calculation shows that 4r) > & + I ) , r = 2 . 3 . 4 , .. . .9. In addition, for r > 4 or so, S , is usually in the saturated region of the activation function, for which any 4 > p implies instability.
References Becker, S., and Hinton, G. E. 1992. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature (London) 355, 161-163. Berns, G. S., Dayan, P., and Sejnowski, T. J. 1993. A correlational model for the development of disparity selectivity in visual cortex that depends on prenatal and postnatal phases. Proc. Natl. Acad. Sci. U.S.A. 90 (171, 82778281. Bienenstock, E. L., Cooper, L. N., and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 3248. Bliss, T. V. I?, and Collingridge, G. L. 1993. A synaptic model of memory: Long-term potentiation in the hippocampus. Nature (London) 361, 31-39. Carandini, M. C., and Heeger, D. J. 1994. Summation and division by neurons in primate visual cortex. Science 264, 1333-1336. Carr, C. E., and Konishi, M. 1990. A circuit for detection of interaural time differences in the brain stem of the barn owl. J. Neurosci. 10(10), 3227-3246. Erwin, E., Obermayer, K., and Schulten, K. 1995. Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Comp. 7,425468. Felleman, D. J., and Van Essen, D. C. 1991. Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex 1(1),1 4 7 . Field, D. J. 1994. What is the goal of sensory coding? Neural Comp. 6, 559401. Fleet, D. J., Jepson, A. D., and Jenkin, M. R. M. 1991. Phase-based disparity measurement. CVGIP 53, 198-210. Goodhill, G. J., and Barrow, H. G. 1994. The role of weight normalization in competitive learning. Neural Comp. 6, 255-269. Hertz, J., Krogh, A., and Palmer, R. G., 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA. Holmes, W. R., and Levy, W. B. 1990. Insights into associative long-term potentiation from computational models of NMDA receptor-mediated calcium influx and intracellular calcium concentration changes. I. Neurophysiol. 63, 1148-11 68. Julesz, B. 1971. Foundations of Cyclopean Perception. University of Chicago Press. Chicago (as cited in Marr 1982, pp. 111-159). Konishi, M. 1992. The neural algorithm for sound localization in the owl. Harvey Lect. 86, 47-64. Levy, W. B., Colbert, C. M., and Desmond, N. L. 1990. Elemental adaptive processes of neurons and synapses: A statistical/computational perspective.
566
Christopher W. Lee and Bruno A. Olshausen
In Neuroscience and Connectionist Theory, M. A. Gluck and D. E. Rumelhart, eds., pp. 187-235. Lawrence Erlbaum, Hillsdale, NJ. Linsker, R. P. 1988. Self-organization in a perceptual network. Computer March 105-117. MacKay, D. J. C., and Miller, K. D. 1990. Analysis of Linsker’s application of Hebbian rules to linear networks. Netmork 1, 257-298. Marr, D. 1982. Vision. W. H. Freeman, New York. Marr, D., and Poggio, T. 1975. Cooperative computation of stereo disparity. Science 194, 283-287. McNaughton, B. L., Douglas, R. M., and Goddard, G. V. 1978. Synaptic enhancement in fascia dentata: Cooperativity among co-active afferents. Brain Research 157, 277-293. Miller, K. D. 1990. Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Comp. 2, 319-331. Miller, K. D., and MacKay, D. J. C. 1994. The role of constraints in Hebbian learning. Neural Comp. 6, 100-126. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Minsky, M. L., and Papert, S. A. 1988. Perceptrons: Ail Introduction to Coinptrtatioizal Geometry (expanded edition). MIT Press, Cambridge, MA. Olshausen, 8. A., and Field, D. J. 1995. Sparse coding of natural images produces localized, oriented, bandpass receptive fields. Tech. Rep. CNN-10095, Dept. of Psychology, Cornell University. (Submitted for publication.) Olsen, J. F., and Suga, N. 1991. Combination-sensitive neurons in the medial geniculate body of the mustached bat: Encoding of target range information. J . Neiirophysiol. 65(6), 1275-1296. Poggio, G. F., Gonzalez, F., and Krause, F. 1988. Stereoscopic mechanisms in monkey visual cortex: Binocular correlation and disparity selectivity. J . Net[rosci. 8(12), 45314550. Qian, N. 1994. Computing stereo disparity and motion with known binocular cell properties. Neural Conip. 6, 390-404. Rumelhart, D. E., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. Reprinted in Rumelhart et al. (1986, vol. 1, chap. 5). Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing: Explorations i n the Microstrnctirre of Cognition, Voltirne 1: Foundations. MIT Press, Cambridge, MA. Sanger, T. D. 1989. An optimality principle for unsupervised learning. In Adziances in Neirrnl Information Processing Systems I, D. Touretzky, ed., pp. 11-19. Morgan Kaufmann, San Mateo, CA. Sereno, M. I., and Sereno, M. E. 1991. Learning to see rotation and dilation with a Hebb rule. In Adzlances in Neural Information Processing Systems 3, R. P. Lippmann, J. Moody, and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. ~~~
Received January 3, 1994; accepted July 18, 1995
This article has been cited by: 2. J. Xu. 2010. Global exponential p-stability in Cohen-Grossberg-type bidirectional associative memory neural networks with transmission delays and learning behavior. Journal of Applied Mathematics and Computing 32:2, 519-534. [CrossRef] 3. Dmitriy Aronov, Jonathan Victor. 2004. Non-Euclidean properties of spike train metric spaces. Physical Review E 69:6. . [CrossRef]
Communicated by Richard Zemel
Predictive Minimum Description Length Criterion for Time Series Modeling with Neural Networks Mikko Lehtokangas Jukka Saarinen Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33101 Tampere, Finland
Pentti Huuhtanen University of Tampere, Department ofMatkematica1 Sciences, P.O. Box 607, FIN-33101 Tampere, Finland
Kimmo Kaski Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33102 Tampere, Finland
Nonlinear time series modeling with a multilayer perceptron network is presented. An important aspect of this modeling is the model selection, i.e., the problem of determining the size as well as the complexity of the model. To overcome this problem we apply the predictive minimum description length (PMDL) principle as a minimization criterion. In the neural network scheme it means minimizing the number of input and hidden units. Three time series modeling experiments are used to examine the usefulness of the PMDL model selection scheme. A comparison with the widely used cross-validation technique is also presented. In our experiments the PMDL scheme and the cross-validation scheme yield similar results in terms of model complexity. However, the PMDL method was found to be two times faster to compute. This is significant improvement since model selection in general is very time consuming. 1 Introduction
During the past 70 years time series analysis has become a highly developed subject. The first attempts date back to the 1920s when Yule (1927) applied a linear autoregressive model to the study of sunspot numbers. In the 1950s the basic theory of stationary time series was covered by Doob (1953). Now there are well-established methods for fitting a wide range of models to time series data. The most well known is the set of linear probabilistic ARIMA models (Box and Jenkins 1970). There are N w m l Computation 8, 583-593 (1996)
@ 1996 Massachusetts Institute of Technology
583
Mikko Lehtokangas et al.
also several nonlinear models, e.g., Volterra series (Volterra 19591, bilinear models (Priestley 1988), threshold AR models (TAR; Tong 19831, exponential AR models (EXPAR; Ozaki 1978), and AR models with conditional heteroscedastisity (ARCH; Engle 1982). Recently several neural network techniques such as niultilayer perceptron (MLP) network (Rumelhart et 01. 1986) and radial basis function (RBF) network (Powell 1987; Moody and Darken 1988) have also been applied for time series modeling. The common factor in neural network techniques is that the models arc constructed from simple processing units that perform nonlinear input-output mapping. The nonlinear nature of these units makes neural network techniques well suited for nonlinear modeling. An important aspect of the modeling is the problem of model selection, i.e., the problem of determining the size and complexity of the model (Weigend ~f 01. 1990). There is a trade-off because an undersized model does not have the power to model the given data. On the other hand, an oversized model has a tendency to perform poorly on unseen data. To overcome this problem we apply the predictive minimum description length (PMDL) principle (Rissanen 1984). It provides a criterion for minimizing the complexity of the model. Using the systematical PMDL procedure MLP networks are applied for time series modeling and prediction. The method reduces the risk of the model under- or overfitting. 2 Multilayer Perceptron Neural Network
In this study we used the MLP architecture shown in Figure 1 for time series modeling. The number of input units is p and the number of hidden units is q. The notation MLP(p.9) will be used to refer to a specific network structure. The weights in the connections between the input and hidden layer are denoted by u’,~and weights between the hidden and output layer are denoted by u,.In addition, the hidden and output neurons have the bias terms zoo, and z’o, respectively. The activation function was chosen to be the hyprrlmiic tnizgeiit (taizh) function in the hidden units. The output neuron was set to be linear. The mathematical formula for the network can be written as
(2.1) The training of the network was done in two phases. First, the initial values for the weights were calculated with the orthogonal least squares algorithm (Lehtokangas et al. 1995). In the second phase the standard backpropagation algorithm was used for weight adjusting (Rumelhart ef nl. 1986). Note that the first initialization phase was used merely to speed up the backpropagation training. This does not affect the model selection results which are shown in the experiments section.
Time Series Modeling
585
Figure 1: Three layer perceptron network with single output. 3 Stochastic Complexity and the Predictive MDL Principle
Even though time series modeling has been studied extensively, there has been no definite methodology for solving the model selection problem. Also, despite the existing model selection techniques there is still a tendency to use models that have an excessive number of parameters. For instance, in neural network modeling too many units in the hidden layer are commonly used. Usually an excessive number of parameters deteriorates the generalization properties of a model. Therefore, it is important to find a model that has the simplest possible structure for the given problem. Inspired by the algorithmic notion of complexity (Solomonoff 1964; Kolmogorov 1965; Chaitin 1966) as well as Akaike's work (Akaike 19771, Rissanen proposed the shortest code length for the observed data as a criterion for model selection (Rissanen 1978). In subsequent papers (Rissanen 1983, 1984, 1986, 1987) this gradually evolved into stochastic complexity, which is briefly described in the following. For applications, the most important coding system is obtained from a class of parametric probability models: M = Cf(x 1 8 ) ,~ ( 81 8) E ilk,k = 1,2,. . .}
(3.1)
in which i l k is a subset of the k-dimensional Euclidean space with nonempty interior. Hence, there are k "free" parameters. The stochastic complexity of x, relative to the model class M is now according to Rissanen (1987) I ( x I M ) = - logf(x I M ) , with f(x 1 M ) =
,f(x I 8) d ~ ( 8 ) (3.21 em Although the model class M includes the so-called "prior" distribution K , its role in expressing prior knowledge is here not different from that
Mikko Lehtokangas et al.
586
of f(x I 8). In fact, the former need not be selected at all, for it is possible to construct it from the model class as a generalization of Jeffreys' prior (Clarke and Barron 1993; Rissanen 1993). Also, particularly important ) the so-called conjugate distripairs of distributions f(x 1 0) and ~ ( 0 are butions, because for them the integral 3.2 can be evaluated in a closed form. The stochastic complexity represents now the shortest code length attainable by the given model class. Yet there is at least one problem to solve, namely the integral in 3.2. Various ways to approximate the integral are discussed in Rissanen (1987). In the following one approximate version, the so-called predictive MDL principle, is presented. Frequently, for example in curve fitting and related problems, the models are not primarily expressed in terms of a distribution. Rather we = F(x 1 8) as in the case of neural are given a parametric predictor if+] is the input and 6' denotes the array networks, for which x = [xf,. . . , of all the weights as parameters. In addition, there is a distance function b ( ~ , )for measuring the prediction error ~t = xt - 2'. Such a prediction model can immediately be reduced to a probabilistic model. In this case conditional gaussian distribution can be defined for the prediction errors as follows:
(3.3) in which xf = X I , . . . ,xf. Negative logarithm is taken from the density (3.3) and it is extended to sequences by multiplication as follows:
(3.4) The above code length can also be written in the form
+CIn t=O
f
I Xf3 & + I . a;+]) f (Xf+l I 2.4.):.
(.t+l
(3.5)
The first term in 3.5 is the predictive code length for the data or Shannon's information. The additional code length represented by the sum in 3.5 is the model cost, i.e., the code length needed to encode the model. It has been proven by Rissanen (1994) that the sum term in 3.5 is asymptotically klog(n)/2. After having fixed the model class, we have the problem of estimating the shortest code length attainable with this class of models. Let e(x') and e2(xf)be written briefly as 8, and 6;.They are the maxiinurn likelihood estimates, i.e., the parameter values that minimize the code length - lnf(xf+l1 x'. 8: a 2 ) for the past data. In particular,
Time Series Modeling
587
The predictive code length for the data and the model is given now by 1 n-1 -1nf (x" 1 k) = TC [ F : + ~ / & : +21nb1]
+ n-In(Z.rr) 2
(3.7)
t=O
in which a suitable initial value for 6; is picked. In this form the model cost appears only implicitly. However, as equation 3.5 showed, the model cost is indeed included in this criterion. Therefore, in the predictive MDL algorithm, the network parameters need not be encoded, and they can be calculated from the past string by an algorithm. Hence, the model cost gets added to the prediction errors, and overfitting and underfitting characteristics are penalized automatically. More details of the MDL principles have been described in Rissanen (1994) and Lehtokangas et al. (1993a,b). Here we have not used the general form (3.7)of code length, but its approximative version, by assuming the variance n2 to be constant. Thus we need not estimate it. Due to this assumption, the total code length will be shorter, and in cases in which the variance is not even close to constant, the results may be distorted. However, in general, this assumption does not critically affect PMDL model selection (Rissanen 1994). The model structure with the minimum predictive code length represents the PMDL optimal model for the given problem. The predictive MDL algorithm can be represented by the following steps: Step 1. Generate a data string xn of length n. Step 2. Divide the data string x" into k,, ments of length d.
=
[n/dl consecutive seg-
Step 3. Select a model structure to be tested and initialize the value of the squared prediction error Rtotal to be zero. For k = 1 to k,,, - 1 Train selected model with data segmentb) 1,. . . ,k. Count squared prediction error Rk for data segment k + 1 using the trained model. Add this prediction error to Rtotal. Next k Divide Rtotal with n - d and set this value for &,ode{. If all model candidates have been tested, then go to Step 4; else go to Step 3.
Step 4. Find the minimum
value.
The model structure that has the minimum optimum model.
&,ode1
value is the PMDL
Mikko Lehtokangas et al.
588
4 Time Series Modeling Experiments
In this section the usefulness of the PMDL model selection scheme is examined with three time series modeling experiments. For comparison purposes the cross-validation (CV) technique (Wahba and Wold 1975; Utans and Moody 1991)was also used for model selection. The version of the cross-validation that was used includes similar segmentation of data to the PMDL scheme. In turn, each segment is left out of the validation segment and the rest of the segments are used for training. The first benchmark time series is artificial and it was generated by the formula xt+1 = COS(Xf)
+ axt-1 lXt-2l0 + Et+l
(4.1)
in which cy = -1.75 and p = 1.5 were used. All the initial conditions were zeros. The additive noise was gaussian such that the signal-to-noise ratio was ux/ae= 10. The second modeling problem deals with time series data measured from a physics laboratory experiment. The time series represents the fluctuations in a far-infrared laser and it was recently used in the Santa Fe Time Series Prediction and Analysis Competition (Weigend and Gershenfeld 1994).’ The third time series represents the load in an electrical network. This series was obtained from industry and, therefore, we cannot reveal any further details about it. We wanted to include these data in this study because they are a good example of real world data. All the three series consist of 2000 points and they are depicted in Figure 2. The first 1000 points of each series were used for model selection with PMDL and cross-validation methods. As the initial values of the weights vary the results, model selection was repeated five times. The final selection was based on the repetitions such that the minimum criterion values for each structure were used. After model selection the first 1000 points were used as a training set and the remaining 1000 points were used as a test set for generalization. The testing of the selected model structures was repeated a hundred times with different initial weight values on each trial. The normalized mean square error (NMSE) was used as the error metric. It is defined as (4.2)
in which u2 is the variance of the time series and m is the number of observations. Results for the three experiments are shown in Table 1. The given NMSE error values are averages of the hundred repetitions. The standard ’The data are available by anonymous ftp at ftp.cs.colorado.edu in /pub/TimeSeries/SantaFe in files A.dat (the first 1000 points) and A.cont (contains continuation for file A.dat; the first 1000 points of the continuation were used as the test set).
Time Series Modeling
589
b)
1
1
1
1
1
0.6
0.4 0.2
0
-0.2
-0.4
-0.6 -0.8
--I1 0
'
1 200
400
600
800
1000
1200
1400
1600
1800
2000
Figure 2: Benchmark time series scaled to the interval -1 to 1. (a) Artificial time series, (b) time series measured from a far-infrared laser, and (c) time series representing the load in an electrical network.
Mikko Lehtokangas et al.
590
Table 1: Results for the Benchmark Problems Time series
Model selection method
Artificial time series Fluctuations in a laser Load in an electrical net
PMDL
cv
PMDL
cv
PMDL CV
NMSE
~NMSE
training set
training set
NMSE test set
~NMSE
Resulted structure MLP(3,5) MLP(3,5) MLP(2,6) MLP(2,7) MLP(4,lO) MLP(4,9)
0.0374 0.0374 0.0295 0.0283 0.0623 0.0629
0.0062 0.0062 0.0033 0.0042 0.0051 0.0052
0.0386 0.0386 0.0337 0.0320 0.1085 0.1102
0.0061 0.0061 0.0046
test set
0.0057 0.0291 0.0330
deviations u ~ M S E for the errors are also given. As can be seen, PMDL and CV model selection schemes yield very similar results in terms of model complexity. With an artificial series the result is exactly the same and with the other time series there is only one hidden node difference. This is not really surprising since the presented version of the PMDL method can be regarded as a type of cross-validation. However, one should keep in mind that the PMDL method has a strong theoretical background and it may be possible to create an even sharper version of it in which assumptions such as constant error variance are not needed (Weigend and Nix 1994). However, the PMDL scheme presented does have at least one advantage over the cross-validation method. The PMDL method was found to be two times faster to compute. This is a significant improvement, since model selection, in general, is very time consuming. The speed-up is a direct result of data segmentation. By assuming that the data set used for the model selection is divided into s (s > 1) segments of equal size and that the computational cost for training one segment to a model is constant, the computational costs of the methods can be compared directly by examining how many times each segment is trained to the model. With CV this number is s(s- 1) and with PMDL it is s(s-1)/2. Hence, with the above assumptions the PMDL method is twice as fast compared to the CV method. Of course in practical simulations the training times may have small variations, and the above comparisons give at best a rough estimate for the real situation. It is also noted that Rissanen (1994) has proposed a practical modification that can significantly reduce the computational cost of the PMDL procedure.
5 Conclusions Time series modeling with a multilayer perceptron network was presented in this study. The problem of selecting optimum sized network architecture for a given problem was studied by using the predictive minimum description length principle. The approach provides a system-
592
Mikko Lehtokangas et al.
Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds., pp. 133-143. Ozaki, 1978. Non-linear models for non-linear random vibrations. Tech. Rep. 92, Department of Mathematics, University of Manchester Institute of Science and Technology, UK. Powell, M. 1987. Radial basis functions for multivariate interpolation. In I M A Conference on Algorithms for Approximation of Functions and Data, J. Mason and M. Cox, eds., pp. 143-167. Oxford University Press, Oxford. Priestley, M. 1988. Nonlinear and Non-stationary Time Series Analysis. Academic Press, London. Rissanen, J. 1978. Modelling by shortest data description. Automatica 14, 465471. Rissanen, J. 1983. A universal prior for integers and estimation by minimum description length. A n n . Statist. 11(2), 416431. Rissanen, J. 1984. Universal coding, information, prediction, and estimation. I E E E Transact. Inform. Theory IT-30(4), 629-636. Rissanen, J. 1986. Stochastic complexity and modeling. Anna. Statist. 14(3), 1080-1100. Rissanen, J. 1987. Stochastic complexity. I. Royal Statist. SOC.Ser. B 49(3), 223-239 and 252-265. Rissanen, J. 1993. Fisher information and stochastic complexity. l E E E Transact. Inform. Theory, submitted. Rissanen, J. 1994. Information theory and neural nets. In Mathematical Perspectives on Neural Networks, P. Smolensky, M. Mozer, and D. Rumelhart, eds. Laurence Erlbaum, Hillsdale, NJ. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., chap. 8, MIT Press, Cambridge, MA. Solomonoff, R. 1964. A formal theory of inductive inference. Inform. Control, Part I, 7, 1-22; Part 11, 7, 224-254. Tong, H. 1983. Threshold Models in Non-linear Time Series Analysis. SpringerVerlag, New York. Utans, J., and Moody, J. 1991. Selecting neural network architectures via the prediction risk: Application to corporate bond rating prediction. In Proceedings of the First International Conference on Artificial Intelligence Applications on Wall Street. Volterra, V. 1959. Theory of Functionals and Integro-differential Equations. Dover, New York. Wahba, G., and Wold, S. 1975. A completely automatic french curve: Fitting spline functions by crossvalidation. Commun. Statist. 4(1), 1-17. Weigend, A., and Gershenfeld, N. (eds.). 1994. Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, Reading, MA. Weigend, A., and Nix, D. 1994. Predictions with confidence intervals (local error bars). In Proceedings of the International Conference on Neural Information Processing, ICONIP-94, Seoul, Korea, pp. 847-852.
Time Series Modeling
591
atic procedure for searching and constructing an optimal model based on input-output observations. A comparison with the cross-validation technique showed that the PMDL method is a useful alternative for model selection in time series applications. Both methods gave similar results in terms of model complexity, but the PMDL method was found to be two times faster to compute. This difference in computing speed is significant, since model selection is, in general, very time consuming. Also, the PMDL optimum structures were found to generalize adequately.
Acknowledgments The authors would like to express special thanks to Dr. Jorma Rissanen for his valuable advice on the PMDL method. Also the authors wish to thank the reviewers for their valuable comments on the manuscript.
References
~.
Akaike, H. 1977. On entropy maximization principle. Appliratfiorrs of Statistics, P. R. Krishnaiah, ed., pp. 27-41. North-Holland Publishing Co., Amsterdam. pp. 27-41. Foremtirig arid Corztrol. Box, G., and Jenkins, G. 1970. Tiriir Series Arial!/ Holden-Day, San Francisco. Chaitin, G. 1966. On the length of programs for computing finite binary sequences. I. Asscic. Comp. Mach. 13, 547-569. Clarke, B., and Barron, A. 1993. Jeffreys’ prior is asymptotically least favorable under entropy risk. 1. Stat. Plaririirig Zriferrrice, in press. Doob, J. 1953. Stodinsfic Processes. Wiley, New York. Engle, R. 1982. Autoregressive conditional heteroscedasticity with estimates of the variance of U.K. inflation, Ec-oriorrietricn 50, 987-1008. Kolmogorov, A. 1965. Three approaches to the quantitative definition of information. Problciris Zriforrri. Trarisrliiss. 1, 4-7. Lehtokangas, M., Saarinen, J,, Huuhtanen, P., and Kaski, K. 1993a. Neural network prediction of non-linear time series using predictive MDL principle. In Pvocwdirigs ojtheJEEE Wiriter Worksiiop of1 Norilirrenr Digital Sigrid ProcessiHg, Tampere, Finland, January 17-20, 7.2-2.1-7.2-2.6. Lehtokangas, M., Saarinen, J., Huuhtanen, I?, and Kaski, K. 1993b. Neural network modeling and prediction of multivariate time series using predictive MDL principle. In Proccrdirigs of the Zritrrriatiorinl Corifr.rmce o r 1 Artificial Nc~r1.111 Netzuorks, ICANN-93, Amsterdam, The Netherlands, September 13-16 pp. 826-829. Lehtokangas, M., Saarinen, J., Huuhtanen, P., and Kaski, K. 1995. Initializing weights of a multilayer perceptron network by using the orthogonal least squarcs algorithm. Nciiral Corrip. 7 ( 3 , 982-999.
Time Series Modeling
593
Weigend, A., Huberman, B., and Rumelhart, D. 1990. Predicting the future: A connectionist approach. Int. f. Neural Sysf. 1(3), 193-209. Yule, G. 1927. On a method of investigating periodicities in disturbed series with special reference to Wolfer’s sunspot numbers. Philos. Trans. X. Soci. A226,267-298.
Received April 12, 1994; accepted August 15, 1995.
This article has been cited by: 2. Michael Small, C. Tse. 2002. Minimum description length neural networks for time series prediction. Physical Review E 66:6. . [CrossRef] 3. Henry Leung, Titus Lo, Sichun Wang. 2001. Prediction of noisy chaotic time series using an optimal radial basis function neural network. IEEE Transactions on Neural Networks 12:5, 1163-1172. [CrossRef] 4. Xiao Ming Gao, S.J. Ovaska, M. Lehtokangas, J. Saarinen. 1998. Modeling of speech signals using an optimal neural network structure based on the PMDL principle. IEEE Transactions on Speech and Audio Processing 6:2, 177-180. [CrossRef] 5. X.M. Gao, X.Z. Gao, J.M.A. Tanskanen, S.J. Ovasaka. 1997. Power prediction in mobile communication systems using an optimal neural-network structure. IEEE Transactions on Neural Networks 8:6, 1446-1455. [CrossRef]
Communicated by Richard Lippmann
Minimum Description Length, Regularization, and Multimodal Data Richard Rohwer John C.van der Rest Department of Computer Science and Applied Mathematics, Aston University, Birmingham, B4 7ET, UK
Relationships between clustering, description length, and regularization are pointed out, motivating the introduction of a cost function with a description length interpretation and the unusual and useful property of having its minimum approximated by the densest mode of a distribution. A simple inverse kinematics example is used to demonstrate that this property can be used to select and learn one branch of a multivalued mapping. This property is also used to develop a method for setting regularization parameters according to the scale on which structure is exhibited in the training data. The regularization technique is demonstrated on two real data sets, a classification problem and a regression problem. 1 Introduction
It is a common practice to train neural network models using a sumof-squares cost function, corresponding to a gaussian noise model. One frequently useful property of this and many other cost functions is that they are minimized by the mean of the distribution of outputs for any given input. However, there are situations in which this property is undesirable. Here we introduce a cost function we call the naive description length, E(NDL),which is approximately minimized by the densest mode of this distribution. It can be regarded as an approximation to the amount of information in the data that the model neglects, or as an approximation to the negative of the number of data points fit well by the model. Two applications areas for the mode-seeking property of this cost function are explored. The most obvious is the problem of finding a single branch of a multibranch function. This capability is demonstrated with a simple inverse kinematics problem. The second application area is the determination of appropriate regularization parameters for good generalization. Here the problem often amounts to determining the input scale on which the model should be allowed to vary. It is argued that with E(NDL),the L-curve method accomplishes this by clearly delineating the boundaries between regularization Neural Computation 8, 595-609 (1996)
@ 1996 Massachusetts Institute of Technology
596
Richard Rohwer and John C. van der Rest
parameter regions in which the model fits clusters of different scales. This is demonstrated numerically using an artificial example to highlight the important properties of the method, and two examples with real-world data sets, a classification problem and a regression problem. 2 Description Length, Clusters, and Regularization __ Any scheme for producing a short description of a data set without losing much information, i.e., data compression, must make use of structure present in the data. Typically the available structure takes the form of clustering on one or more scales. Therefore minimization of a description length can be expected to yield information about clusters. If the description takes the form of a neural network model supplying a single output y ( x ) for each input x, together with a noise model of the standard form (Bishop 1995) P ( Y 1 y ) = (l/Z)c--'F'Y.y'with a cost function E ( y . Y ) monotonic in /y- Y/, then P ( Y 1 y ) is unimodal. A minimal description in terms of such models can do no better than t o locate the largest mode of a distribution, which is what E''DL1 approximately accomplishes. More complex neural network models can be devised to avoid this restriction (Bishop 1994). For optimal data compression, the objective is to minimize the total description length, accounting for both the errors and the model. However, an interesting perspective on regularization can be obtained by considering just the error description length as the model complexity (or a regularization parameter) is varied. Consider the common case of a data distribution for which an output cluster location varies gradually with the input. A loosely regulated model will be flexible on the scale of the noise, and therefore fit the noise tightly, whereas a strongly regulated model will not. The trick, of course, is to regulate just enough to lose the noise but not the gradual variation of the cluster. The noise can be seen to vary on an arbitrarily small scale if an arbitrarily large amount of data is available, whereas the systematic trend has a dataindependent scale. Therefore, given enough data for the clusters to be discernible, these scales will differ significantly. Consider the error description length as regularization is increased from an overfitting regime to an underfitting regime. As soon as the model becomes inflexible on the scale of the noise (as sampled), there is little structure for it to fit on any scale below that of the systematic variation. Therefore, at this point the error description length will increase rapidly from its low value for fitting the noise to a high value for neglecting the noise. This identifies the desired regularization parameter, that just at which the model abandons the noise. This is essentially the "L-curve" method of Hansen and OLeary (19931, using an improved cost function. These authors plot data cost ED(&) against regularization cost oEu,(ct),where ED(o) = ED[w*(O)]and ET(,(o)=
Description Length, Regularization, and Multimodal Data
597
+
E,[w*(a)], with w*(a)= argmin,[ED(w) crE,(w)]. For the problem of determining the input of a known noisy linear filter from its output, they show under reasonable assumptions that this plot is approximately "L"-shaped: there is an approximate threshold below which ED can be reduced rapidly at Iittle expense in aEwrand above which E, increases rapidly with little improvement in ED. The above argument leads one to expect a similar cliff in a plot of ED(.) vs. a. Its location is indicated by a positive peak in the second derivative Eg(a). As a practical matter, a large amount of computation would be needed if training were restarted from random weights for every value of N examined, but this is neither necessary nor desirable. It is much faster to initialize training for one cr value from the solution for a neighboring value (starting from large a, where solutions are simple). This procedure also ensures that the a-dependence of just one solution is determined. Another method for setting regularization parameters (without recourse to test data) is the Bayesian method, starting from a broad hyperprior over N (MacKay 1992). This is a computationally intensive, and, in some approximations, theoretically controversial technique (Wolpert 1993; Zhu and Rohwer 19951, but it gives good results at least in some circumstances (Thodberg 1994). It may be possible to express the L-curve method in this language, but one important difference at present is the form of the prior knowledge assumed. As practiced, the Bayesian approach uses a prior independent of the data (and excuses are proffered if and when it is expedient to use the data to influence the prior). The L-curve method assumes something about the data, that there is enough of it to admit interpretations in terms of structure on different scales. 3 The "Naive Description Length" (NDL)
The intuitions developed in the preceding sections about mode-seeking cost functions will be developed and confirmed numerically using a specific cost function of this type. We call this the "naive description length" (NDL) because it can be interpreted as the length of a simple but suboptimal representation of the residual errors needed to reconstruct the training data targets from the network outputs. Suppose each target needs to be reconstructed only to within accuracy t in each component, the ith component of the pth example being YIP. This target component can be expressed in terms of network output ylPas ylp- el,, where el, = yIp- YIP.Like Y,, the "error" ep need only be recorded to within accuracy E in each component. That is, with referring to the nearest integer greater than or equal to x, we have eIy = 11 eryI / ~ ] E , so e,p can be represented by the integer el, 1/61 at a cost El, of about log, lerpl/cbits, plus a sign bit. To represent a set of numbers of different lengths, further information is needed to separate the different numbers. For example, each number
[XI
598
Richard Rohwer and John C. van der Rest
could be preceded by a field of log,{ log,[max,,,(e,,)]} bits encoding the length of the number to follow.' Ignoring the fixed per-number costs of sign bits and separator fields, and the fixed costs of encoding f and generating yl,, the cost of encoding the data is
It is easy to see that this expression will be minimized to 0 if the data lie in clusters of size c, and for each example p , the network output ylv lies within f of it5 target Y,,,.To the extent that errors greater than F are inevitable, they will make roughly comparable contributions because the logarithm deemphasizes their differences. Therefore this expression is minimized primarily by placing the interpolant within f of as many points as possible, in as many output coordinates as possible, regardless of how well the remaining points are approximated. If a close match in all output coordinates is required, then the variant
can be used, where /et,lis a norm appropriate to the clustering problem. This departs somewhat from the description length interpretation, in that it ignores the information needed to record the direction of the error vectors. Aside from a constant factor introduced by using natural logarithms instead of base 2 logarithms, (3.2) can be represented as the I z + x limit of the differentiable error measure (3.3) For finite { I this replaces the max function with a differentiable one. Using the p-norm with parameter for /e,,i gives
which will be taken as a definition of the naive description length. The I I = 2 and f = 0 0 5 case will be used in what follows unless stated otherwise. The key property of E'"DL' is illustrated in Figure 1. On a data set of 2 points with the same input and different outputs, one at 0.0 and the ' A more sophisticated method such as Hutfnlan coding could be used to eliminate the nerd for separators and provide a more optimal assignment of error values to integers, but this would defeat the objective of obtaining a simple method leading to a simple cost function. In particular, this method has the advantage that it does not require detailed knowledge of the probability distribution of the errors.
Description Length, Regularization, and Multimodal Data
599
N a v e Dencription Length Error arith c = 0.01
I
'
~
"
"
"
'
1 Total
-
Esrbpattern --
-
5 n
0.1
01
0
'
0.1
'
'
'
'
'
.
0.2 0.3 0.4 0.5 0.6 0.7
'
"
0.8 0.9
0
1
0.1 0.2 0.3 0.4 0.5
0.6 0.7 0.8 0.9
I
Figure 1: The sum-of-squarescost (a) and the NDL cost (b) on a data set of 2 patterns with the same input and different outputs, one at 0.0 and the other at 1.O. The dotted lines show the contribution of each pattern.
other at 1.0, a sum-of-squares cost function has a minimum at the mean, 0.5, whereas has two equal minima, one near 0.0 and the other near 1.0. These minima gradually merge as E is increased toward 1.0, the scale on which the two outputs would be judged to be part of one cluster. Crudely, E(NDL)contains roughly the same nonzero contribution for every pattern the model does not fit, and no contribution for patterns that do fit. This makes it roughly proportional to the number of ill-fit patterns. By subtracting this from the total number of patterns P and rescaling, an estimate of the number of well-fit patterns
ECE'
/
(3.5) is obtained, valid for E --+ 0. This expression can also be understood as a sum containing a logarithmic divergence for each well-fit point (ei, M O), and smaller numbers for other points. The interpretation as the number of well-fit points provides another way to understand the properties of the L-curve with E(NDL).When a model loses its flexibility on the scale of the noise, it will suddenly fail to fit most points precisely. 3.1 The Noise Model for E(NDL). The mode-seeking properties of €(NDL)
can also be understood in terms of its noise model,
600
Richard Rohwer and John C. van der Rest
-1
4.8 0 . 6 0 . 4 -0.2
0
0.2 0.4 0.6 0.8
1
u-Y
Figure 2: (a) P(NDL)for = 0.1, n = 2, and p = 10, together with P(MSE) for f/fl = 0.05. The central peaks are similar, but the tails decay faster in the gaussian. (b) Cost functions for noise models in (a) (simply their negated logarithms). Note the different horizontal scale.
where /y - YI” = C, lylP- Y;p(”.Figure 2 compares the NDL noise model and cost function with a similar gaussian noise model and corresponding sum-of-squares cost function E(MSE). Although not obvious to the eye, the slow increase of E(NDL)with pattern mismatch is reflected in the slowly decaying tail of FNDL). This in turn makes P(NDL1 assign similar probabilities to a relatively broad range of outliers, which accounts for its relative insensitivity to their magnitude. The normalization factor Z(NDL) does not exist if ;I 5 1 and the domain of Y extends to infinity. It can be expressed in closed form for rz = 1, ?I + m, and n = /j, and otherwise can be obtained numerically. The numerical integration is not straightforward when /?1 z 1. These are the most interesting cases, when the tails of P(NDL) are the most substantial and its distinguishing properties therefore the most pronounced. Details on numerical approximations to Z(NDL), its relationship to the generalized Beta function, and its approximation by a hypergeometric series can be found in Rohwer and van der Rest (1994). Some special cases of P(NDL) can be identified. For n = 2 and = 1, PtNDL) is (aside from the divergent normalizing factor) the mulfiqrinrlric, which has been used in basis function interpolation methods (Hardy 1990). For $ = n = 2 it is the Cauchy distribution, which is favored in some variations of the simulated annealing algorithm because its tails decay relatively slowly (Zhu 1986), the very property that is useful here for other reasons.
Description Length, Regularization, and Multimodal Data
601
Figure 3: A robot arm in the elbow up, elbow down, and mean positions. 4 Numerical Demonstrations
Four examples are given to demonstrate that training neural network models with €(NDL) works as anticipated. The first demonstrates its mode-seeking property and the final 3 illustrate its use in selecting regularization parameters. 4.1 Inverse Kinematics. A classic situation in which a mode is preferable to an average is the inverse kinematics of robot arms. Figure 3 illustrates a simple example used by Bishop (1994) of a 2-joint robot arm that can reach a point of the plane in two different ways, "elbow up" and "elbow down." The forward kinematics from joint angles 81 and Q2 to planar position (x1,yI) is given by
x1
=
x2 =
+ 02) L1 sin(&) - L2sin(B1+ 0,)
L1 cos(81) - L2 cos(&
(4.1)
in terms of the link lengths L1 and L2, taken to be 0.8 and 0.2, respectively. Training and test sets of 1000 points each were generated by randomly selecting angIes in the range (7r/2,37r/2) for 81 and (0.3,1.2) for 02. A multilayer perceptron with one layer of 100 hidden nodes was trained to estimate the angles as a function of the end effector position, using 6000 iterations of an adaptive step size algorithm (Silva and Almeida 1990). The test set results after training with €(MSE) and E(NDL)are shown in Figure 4. Each line segment connects a network output, fed through the forward kinematics, with the corresponding correct position. The region where two solutions exist shows u p clearly in the sum-of-squares case as a region of large errors. The network trained with E(NDL)converges
602
Richard Rohwer and John C. van der Rest
Figure 4: (a) The positioning errors of a network trained on E(NDL). Each line segment extends from a desired position to the position produced by the network. (b) The positioning errors of a network trained on E(MSE).
Description Length, Regularization, and Multimodal Data
603
to just one branch of the solution, except in a small region where it switches branches. It is forced to switch branches somewhere because regions exist that can be reached only on one branch or the other. Good solutions can be produced throughout one branch by removing the data from the region accessible only by the other branch (but keeping the data where multiple solutions are possible). 4.2 Choosing a Regularization Parameter: An Idealized Example. An artificial example clearly demonstrates that inspection of the curve EPDL)(a) highlights the plausible interpolations of a data set D, while pointing to the appropriate (Y values for each possibility. Consider a data set consisting of 6 evenly spaced points at y = 0.2,3 more interspersed at y = 0.3, and 2 at y = 0.9, as indicated by the 11 rows of dots orthogonal to the xy plane in Figure 5a and b. "Smooth interpolation" could be given 3 different reasonable meanings for these data, depending whether it were thought to be noise-free, to have 2 outliers at y = 0.9, or to have these and a further 3 outliers at y = 0.3. An oversized MLP2 was trained on these data using a sum-of-squares regularization term E , and a wide range of regularization parameter values a in the total training error E D + ~ E , . Separate experiments were done using E(NDL)and E(MSE)for ED. Figure 5 shows the variation of the interpolant curve with a, aligned with plots of3 €:(a) and ED(@). With E(NDL), the interpolants split cleanly into the three types just described, and the ) a values at the upper well-defined positive peaks of E ~ ( ( Ydesignate end of the range for each region. The corresponding plot using E(MSE) shows none of this structure. Although the plots are not shown, these experiments (and those reported in the next section) were repeated using aE,(a) in place of a. This dulled the desired effect in every instance. 4.3 Choosing a Regularization Parameter: Examples Using Real Data. Figures 6 and 7 use natural data to demonstrate (with some provisos) that with E(NDL),the L-curve provides good estimates of the regularization parameter values that give the best generalization. Results are given for a classification problem and a regression problem. Two sets of vertically aligned plots are shown for each data set, one in the left column based on training with E(NDL),and one on the right based on a more conventional cost function, cross-entropy E(Xent)for the classification problem, and mean-squared error E(MSE)for the regression problem. In each case, ED(^) and Ef;(a)are plotted, together with a measure of the test set performance as a function of a, classification accuracy for the ~~
~
2Fivegaussian coarse coded inputs, 110 hidden units, 1 sigmoidal output unit, strictly layered. 31n these examples it is convenient to plot a on a logarithmic scale. To make the second derivatives highlight the regions of highest curvature as seen in these plots, EL(cu) is defined as d2ED(eY)/dy21,=I,,, instead of d 2 E D ( a ) / d a 2 .
604
Richard Rohwer and John C. van der Rest
Figure 5: (a) Interpolations learned using €‘“DL’, as o is varied. The 11 training data points are indicated by the 11 rows of dots orthogonal to the xy plane. The interpolation curves are illustrated between 0 and 1 on the x-axis. €;,(a) is plotted in the rry plane at r = 1.1 and E D ( ( ) ) is plotted in this plane at r = 1.2. Shading emphasizes the alignment of the peaks of € g ( r ) ) with the upper end of the range of ( t values that give each of the 3 styles of interpolant.
classification problem, and mean-squared test set error for the regression problem. A spherical radial basis function network with one center at each training data point was used in these experiments, with a sum-of-squares regularizer. Sigmoidal nonlinearities were placed on the outputs for the classification problem, but not for the regression problem. Figure 6 gives results for the classification problem, the ”vowel” data from Peterson and Barney (1952), also used by Huang and Lippmann (YDLI“ ( ( 1 ) indicates the optimal (1988). Note that the positive peak of ED (1 value for test set classification accuracy, and accuracy drops abruptly beyond this point. This optimal value (81.3%)is comparable to the best results in the literature. The peak of E P t r i ’ ( o )gives a fair but inferior result. Figure 7 shows the results for the regression problem, the ”Auto Price”
Description Length, Regularization, and Multimodal Data
605
Figure 5: (b) As (a), using E(MSE)instead of E(NDL). data set used by Kibler et al. (1988) and Rasmussen (1995). Training with E(NDL)yields the E ~ ) ( O and ) E F ) ” ( a curves ) shown. The first two (NDLj peaks of ED ( a )indicate good test optima, which compare well with the best results in the literature. The third peak can be viewed suspiciously because it corresponds to a high E r D L ’ ( a ) value. The existence of multiple peaks suggests multiscale structure in this data set that might profitably be explored with a more complex model. These results were confirmed by repeated experiments with different initial random weights, and further supportive results have been obtained on other data sets. Further work is needed to determine the limitations of the method. For instance, problems may arise if P(NDL) is a poor noise model for some problems, or if a data set does not have structure on any particular length scale. “
5 Conclusions
A qualitative argument relating cluster detection to minimizing description length motivates the introduction of a class of cost functions with two useful and interesting properties. The first is that the minimum ap-
606
Richard Rohwer and John C. van der Rest
-15
~10
-5
0
5
Figure 6: Test set error and L-curve features for the Vowel Classification prob(NDL) lem. The top left plot is the L-curve, ED ( 0 ) vs. logtr. The bottom left plot (NDL 1'' ( a ) ,the positive shows its numerically determined second derivative ED peak of which helps to locate the bottom of the 'L' cliff in ELND1.)((~).This point locates the largest i i value of the region giving the best test set classification performance, shown in the lower two figures by the dotted vertical line. The same procedure is illustrated in the three plots on the right using The position of the bottom of the cliff is less well-defined, and does not correlate well with the test set performance in any case.
Description Length, Regularization,and Multimodal Data
607
Network Error
I
I5
10
i
5
log a Test Set MSE Error
.
08
07
075 -
065 -
07
-
065
.
055 -
-
0.45 -
05
5
04 -
04 -
0.35 -
.
0.3
03
5
0
.
05
045 035
5
06 -
06 055
0
log a Test Set MSE Error
10
5 log a
0
5
Second derivative
'
I
'
-I5
-10
-5
Ion 0Second derivative
3 ,
I
2 -
A I
I -
0
2 .
4
.,"
-15
-10
5
tog a
0
3
-IS
-10
5 log a
0
5
Figure 7 As Figure 6 for the Auto Price regression problem, using E(MSE)for the plots in the right column. proximates the largest mode of the data rather than the mean. A simple multivalued inverse kinematics problem demonstrates one way to make use of this property. The second property is that it enhances the effectiveness of an L-curve method for exploring the structure of a data set and determining suitable regularization parameters. This was demonstrated in a classification problem and a regression problem. Therefore this is a new approach showing considerable promise. Further work will be needed to opti-
Richard Rohwer and John C. van der Rest
608
inize the method, fully develop its theory, a n d determine the limits of its applicability.
Acknowledgments
~
We thank Chris Williams for bringing the Hansen a n d O L e a r y (1993) reference to o u r attention, and for assistance in selecting a n d interpreting some of the data sets. References Bishop, C. M. 1993. Mixtiire derisity rirtzuorks. Tech. Rep. NCRG/4288. Department of Computer Science and Applied Mathematics, Aston University, Birminghani, UK. Bishop, C. M. 1995. NeiiroI Netiilorks f o i - Pnttcrrr Rtwgrritioii. Oxford University Press, Oxford. Hansen, P. C., and OLeary, D. P. 1993. The use of the L-curve in the regularization of discrete ill-posed problems. S l A M ]. Sci. Corirpiit. 14, 1387-1503. Hardy, R. L. 1990. Theory and applications of the multiquadric-biharmonic method. Coiriput. Math. Applic. 19, 243-263. Huang, W. Y., and Lippmann, R. P. 1988. Neural net and traditional classiI S .Anderson, ed., fiers. In Adcnrict’s irz Neriral Iiif~irrizntioirProcessiiig S ~ S ~ P I I D. pp. 387-396. American Institute of Physics, New York. Kibler, D., Aha, D. W., and Albert, M. K. 1988. Irrstnrzce-hascd ,1n~1ictiorrof realinllted attributes. Tech. Rep. 88-07, ICS, University of California, Irvine. MacKay, D. 1992. Bayesian interpolation. Neiirnl Comp. 4, 415-447. Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of vowels. /. Ac~iist.soc. Ariz. 24, 175-184. e iiizplrrizeiitatioiz cifBn!/rsiaii learning. Rasmussen, C. E. 1995. A practical M ~ t Corlo Tech. Rep., Department of Computer Science, University of Toronto. Rohwer, R., and van der Rest, J. 1994. Mii7imlirir descriptiorr /erz
Description Length, Regularization, and Multimodal Data
609
Zhu, H., and Rohwer, R. 1995. Information geometric measurement of generalisation. Tech. Rep. NCRG/4350, Department of Computer Science, Aston University, Birmingham, UK.
Received June 29, 1994; accepted September 6, 1995.
This article has been cited by: 2. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus]
Communicated by Bartlett Me1 and Laurence Abbott
VC Dimension of an Integrate-and-Fire Neuron Model Anthony M. Zador Salk Institute, 10010 N . Torrey Pines Rd., La Jolla, C A 92037 USA
Barak A. Pearlmutter Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540 USA
We compute the VC dimension of a leaky integrate-and-fire neuron model. The VC dimension quantifies the ability of a function class to partition an input pattern space, and can be considered a measure of computational capacity. In this case, the function class is the class of integrate-and-fire models generated by varying the integration time constant 7 and the threshold 0, the input space they partition is the space of continuous-time signals, and the binary partition is specified by whether or not the model reaches threshold at some specified time. We show that the VC dimension diverges only logarithmically with the input signal bandwidth N. We also extend this approach to arbitrary passive dendritic trees. The main contributions of this work are (1)it offers a novel treatment of computational capacity of this class of dynamic system; and (2) it provides a framework for analyzing the computational capabilities of the dynamic systems defined by networks of spiking neurons. 1 Introduction
A central concern in computational neuroscience is understanding the functional significance of single neuron complexity. On the one hand, the success of artificial neural network models, which begin with the notion that brain-like computation can be well described by large interconnected networks of very simple elements, argues that the computational capabilities of the individual elements can be neglected. On the other hand, a vast body of research (see, e g , McKenna et al. 1992) supports the notion that single neurons are complex dynamical systems, able to perform a wide range of interesting computations. Brown et al. (1992) argued for a synthesis of these positions: if individual neurons have computational significance, then perhaps each should be considered a micronet in its own right. To assess the computational signficance of single neurons, it would be useful to have a quantitative measure of computational capacity. The Vapnik-Chervonenkis dimension (1971) can be considered such a measure Neural Computation 8, 611-624 (1996) @ 1996 Massachusetts Institute of Technology
Anthony M. Zador and Barak A. Pearlmutter
612
for static neural networks (or, more generally, for any boolean function class). It is a measure of the richness of the mappings possible within a class of functions, and typically increases as the size of the network (i.e., number of free parameters) increases. Such measures have not been applied to models of real neurons, in part because real neurons are dynamic systems. There is as yet no satisfactory general theory of computation in dynamic systems. As a step in that direction, we have extended the notion of the VC dimension to dynamic systems. We consider the class of noiseless leaky integrate-and-fire threshold models with time constant T and threshold H driven by continuous-time inputs; we then extend our analysis to noisy inputs. These models have been developed as simplified descriptions of the more complex dynamics of real neurons. We define the VC dimension in terms of the ability of this class to assign an arbitrary boolean ”label” to each input signal; the largest number of signals to which every possible labeling can be assigned is its VC dimension. We show that the VC dimension diverges logarithmically with the input signal bandwidth N.
2 Review of VC Dimension
The VC dimension (Vapnik and Chervonenkis 1971) is a measure of the richness of a class of boolean functions. It gives an upper bound on the number of exemplars required to guarantee that a set of parameters fit to data will provide a good fit for new data (Blumer r t al. 1989). It has been applied in the neural network literature to give a measure of the number of patterns needed to train a network of a given size. Here we present a brief overview of the VC dimension in the context of neural networks (see Abu-Mostafa 1989 for an introduction). Let F c iRh’ {O.l), and fiL. E F be some member of that class. For example, F could be the class of all 3-layer feedforward threshold networks with N inputs, 12 hidden units, and one output, parameterized by c = 12N 25 weights, and fit.would be some particular choice of the . . . . IiM’)of inputs, c-dimensional weight vector w.For every set I = (I1’). any choice of u 1 = u1‘ specifies an M-digit binary string Yzr8= fil,l(I) = fLt,<(If1’ ) . . . . .fTL,r(I’”’’’)in which the n7th digit corresponds to the output of fil. on the mth input l””’; varying zil will in general produce a different binary string. Ycc,J,which is actually a function of the inputs, YTi,r(I“).. . . . I ( M ) ) , can be thought of as the truth table for a particular choice of ZL~ = zu’ on the inputs. Now in principle Y,. can take on 2M possible values; but for large M there may not be choices of 11’ that instantiate every possible binary number. When there exist 2M values of 711 such that Y7,, takes on all the 2M possible values, the function class F is said to shatter the set of inputs I. This leads to the VC dimension dvc of F: the VC dimension is
-
+
VC Dimension of a Model Neuron
613
the largest number M for which there exists a set of inputs ( I ( ' ) .. . . . which is shattered by F. In the context of learning theory the VC dimension is useful because of a relation between the number of labeled exemplars in a training set and the probability of generating the correct output on a new exemplar (Vapnik and Chervonenkis 1971). If the number of exemplars is greater than the VC dimension, then the probability of producing an incorrect response decreases exponentially with the number of exemplars. Much work has gone into computing the VC dimension of certain classes of neural networks (Baum and Haussler 1989; Maass 1995; Anthony 1994).
3 An Integrate-and-Fire Classifier
The classifier we consider is a leaky integrate-and-fire unit with two free parameters: a single time constant T and a threshold 8, as shown in Figure 1. The inputs are continuous time signals, and the output is a binary variable determined by whether the voltage exceeds the threshold at any time t . The voltage V(f) of the unit at time f is given by the convolution of the input I(f) with a single exponential kernel w(f) = ct/',
The convolution kernel has only a single exponential; this corresponds to the output of a single RC integrator. We now define V as the voltage at the end of the interval [O. ff],
V = V(ff) The output Y of the unit over this interval is a binary variable, obtained by applying a threshold 0 to V ,
Notice that the voltage V(f), f < tf does not involve thresholding; the threshold is imposed only at t = ti, so the present model is an integrateand-fire model without reset. Only when V(t) remains subthreshold over the interval does it give the the same output as standard integrate-andfire models, which reset after each threshold-crossing. If we would like our results to carry over to resetting models, we must be careful to consider inputs that do not cause V ( t )to exceed threshold prematurely.
Anthony M. Zador and Barak A. Pearlmutter
614 ~~
Input Signal 10)
1
no spike
I
spike!
I
i
no splke
Figure 1: The model. An input signal I ( t ) is convolved with a kernel and passed through a threshold to produce a binary output. The output of three distinct kernels, differing only by the time constant 7,to the same input is shown. The input has been constructed so that for low (7 = 1/3) and high (7 = 1) values, V ( t )remains below the threshold 0. For an intermediate value ( T = 1/21 V ( t ) exceeds 0 at the asterisk and emits a spike. Note that the fluctations around the threshold are very small, indicating a high sensitivity of the system to noise.
4 Convolution as a Product
We now consider the temporal discretization into evenly spaced intervals, to, . . .
3
fN-1,
v=
c
N-1
(4.1)
liUli
i=O
where I, = I(fN-1
- ti)
and N is the signal bandwidth, with
This is simply the discrete convolution of the input I, with a kernel wi. Note that this equation can be interpreted as a one-output perceptron with an N-dimensional input vector I and a weight vector w. We observe that due to the physical constraints of positive time constants T > 0, and
VC Dimension of a Model Neuron
615
t; 2 0,we find a constraint on w;, (4.3) In equation 4.1 we used the conventional represention of the discrete convolution as a sum. In assessing the VC dimension it will be convenient to work with an equivalent representation as a product. We observe that the convolution equation 4.1 is polynomial in w1,since
where At
=
t,+l - t,. We can therefore write
N
N-1
G O
i=o
v = C I Z ( W l ) '= I N
I-I(w1- r , )
(4.5)
where ri are the N = t f / A t+ 1 roots of the polynomial. Equation 4.5 expresses the output V of the integrate-and-fire unit as a polynomial of degree N in the weight kernel wi,specified by the parameter wl. The output is a function of 7 , since the weight kernel is related to 7 by equation 4.4. The coefficients wi of the sum, and therefore the locations ri of the roots, are determined by the inputs. Different integrate-and-fire units may assign different outputs to a given input as the parameter w1is varied. The advantage of the product representation is that it allows us to see explicitly the critical values of w1 at which the output in response to a given input changes. Specifically, the critical values are the roots ri of the polynomial. Since the roots are determined by the input signal itself, the critical values of w1 depend on the input itself, and will in general be different for different inputs. 4.1 Constructing a Shatterable Set of Inputs. The key construction of this section (illustrated in Fig. 2) is a procedure for "inverting" the integrate-and-fire neuron, by constructing an input signal I ( t ) given a list of w1values and corresponding responses (spike vs. no spike).' For now we consider only the zero-threshold case. Before proceeding, let us specify the elements of the construction. We will form a set of input vectors I(1),. . . ,I(M). Each N-dimensional input vector is obtained by sampling a continuous waveform I ( t ) at N uniformly spaced points. For any given value of wl,equations 3.2 and 'If a finer temporal discretization is desired, adding extraneous roots gives the input waveform more sample points without inh-oducing extra sign changes. If explicitly continuous-time inputs are to be constructed, one can consider the Laplace transform of the input, and note that the desired outputs correspond to simple sign constraints in the Laplace domain. A function in the Laplace domain that meets the constraints can be concocted, and an inverse Laplace transform gives the corresponding time-domain input. There is a great deal of freedom in this concocting, but one natural class of inputs resulting from the inverse Laplace transform is a series of modulated delta pulses. It is interesting to note that the inputs neurons typically receive consist of a series of action potentials.
616
Anthony M. Zador and Barak A. Pearlmutter
Figure 2: Diagrammatic representation of the construction of an input that results in output spikes at particular desired neuronal time constants T ~ Given . a set of time constants T~and associated binary desired outputs, a single temporal input is constructed that has the property that, when tlie neuron's time constant is set to r,,the associated desired output is produced. The construction proceeds in stages: the time constants are passed through a function, the transitions in the desired outputs are marked and arbitrary points in the corresponding intervals are chosen, a polynomial with these points as roots is constructed, and the coefficients of this polynomial form the desired temporal input. To construct a set of'2 sliatterable inputs, this construction is used 2" times.
4.1 determine the binary value of the output Y'""in response to input I{"". Thus each value of u'1 specifies an M-digit binary number, in which the tilth digit is the output Y""'in response to input I('"'. We call Y'"') the lrrbrl associated to the input I('")by a given value of eul, and Y is the M-digit label associated by a given value of iul to the set of M inputs. There are 2M possible such labels associated with any set of M inputs. Recall that if a set of 2"" values of zo1 can be specified, such that this set associates all possible labels Y to the input set, then this set is said to shatter the inputs. The VC dimension is the largest value of M for which a shattering set of w 1 s can be found. Our task is therefore to construct a set of M inputs and specify a corresponding set of 2" values of such that the input set is shattered. We begin by considering how the labeling of a given input varies with zo1. That is, what is the inth digit, considered as a function of zuI, of the label Y associated with the input I('")? Using the product representation of the convolution from the previous section, we observe that the label changes whenever zul passes through a root. Within the interval between two roots, r; < w1 < rlTl,the label remains unchanged. We can therefore conveniently manipulate the labeling associated with a given input by judicious placement of the roots. In fact, once the roots are specified, the Z L J ~
VC Dimension of a Model Neuron
617
input is obtained simply by multiplying out the product in equation 4.5 to obtain the coefficients 1,. Now we turn to the M-digit label Y associated with a specified value of w1.For this we hold w1 fixed, and consider the label associated with each input I('), . . . ,I(M)in turn; these are the digits of Y. But since we have already shown how to obtain the desired label for any particular input-by placing the roots appropriately-obtaining the desired label Y for a given w1 merely requires choosing the roots associated with each input in turn. Thus we have a procedure for constructing an input set that associates a specified label Y with the input set for a particular value of w1. 4.2 VC Dimension Depends on Signal Bandwidth. So far we have shown how to construct an input set labeled by a specified Y for a given value of w1. The final step requires constructing an input set that is shatterable-a set for which Y assumes all ZM possible values, at 2M values of wl,0 < w1 < 1. That is, we must partition the wl-axis into ZM regions. The boundaries between the regions are determined by the roots: the presence of a root at some w1for the mth input means that the The number of digit changes mth digit of Y changes at that value of wl. is NM: N roots/input times M inputs. Since Y is an M-digit binary string that we require to assume all 2M possible values, we can regard this as counting in binary. Now counting from 0 to ZM - 1 in standard binary involves 2 M l o g M digit changes. For example, the transition from 011l2 = 7 to 10002 = 8 involves 4 digit changes. To make best use of the NM roots, we therefore adopt a different counting scheme, a Gray's code? so that only N M digit changes are required. Figure 3 shows how to construct a shatterable set of M = 2 inputs using this scheme. Here the requisite bandwidth is N = 2. The 3/8,6/8). Expanding roots of the first input I(")are placed at do)= (1/8% as in equation 4.5 gives the actual sampled values of I("). The number of roots of each polynomial is determined by the temporal discretization N. For a set of M bandlimited signals, there are at most NM distinct roots, which can be used to divide the w-axis into NM + 1 regions
number of labels = NM
+1
Thus the VC dimension is determined by the sampling rate. To achieve dvc = M, we choose a sufficiently large N given by
*A Gray's code is an ordering of the binary numbers 0 , . . . ,2M - 1 such that adjacent numbers differ in only one digit. For our purposes, we choose a Gray's code in which all digits change state the same number of times, namely 2M/Mtimes.
Anthony M. Zador and Barak A. Pearlmutter
618
I n p i Signal I
+ Coovoluuon
II
+->
Exceeds Threshold?
44 ..............___.. ....................
!a
10
11
Figure 3: A set of shattered inputs. M = 2 input signals are constructed such that there exist neuronal time constants 7 1 , r 2 , 7 3 , 74 that induce all 2M = 4 possible labelings.
where r.1 indicates rounding u p to the largest integer. This shows that with a sufficiently high sampling rate an arbitrarily high VC dimension can be achieved. Since N is determined by the sampling rate of a continuous signal, the VC dimension of a signal of infinite bandwidth is unbounded. It is important to note, however, that the dependence of the VC dimension on the signal bandwith is only logarithmic, and therefore the divergence is weak. 4.3 Threshold: Preventing Premature Discharge. The model we have been considering (equations 3.1 and 3.2) has no reset; V does not depend on whether V ( t )exceeds threshold at any time within the interval 0 5 t 5 tf. This of course is not the expected behavior from an integrateand-fire model. Typically, integrate-and-fire models reset V (t) + 0 after discharging (sometimes imposing a refractory period as well). The inputs we have constructed will not typically be shattered by an integrate-and-fire model with reset. However, by using a nonzero threshold, we can construct a new set of inputs that is shattered. First we set the threshold to exceed the maximum over the interval, 6' > maxf<,,V(f). We now add the threshold to the final term of each input signal (corresponding to the constant term of the associated polynomial) to create a new set of inputs 1'. These new inputs differ only at 10,
I:, = I0
+B
VC Dimension of a Model Neuron
619
(Note that because of the definition of I in equation 4.1, 10 corresponds to I ( t f ) ,i.e., it is the last point of the sampled waveform.) Since zuo = 1, this shift guarantees that digit changes, which previously occurred when V crossed zero for different values of w1,now occur when V crosses 0. 5 Special Cases and Extensions 5.1 VC Dimension for Purely Positive Inputs. The VC dimension of a system with purely positive inputs is 1. This is of interest when considering inputs generated by purely excitatory synaptic inputs. To show this, we note first that by construction, the shattering inputs oscillate around 0. That is, for each input, subsequent points 1, and &+I have opposite signs. This follows from equation 4.5: the nth order coefficient is generated by the sum of the products of N - n negative terms (since yi > 0), which is positive if N - n is even and negative otherwise. Conversely, if I is purely positive, then the roots are all negative and imaginary. They are therefore physically unrealizable under our assumptions (equation 4.3). Thus the VC dimension is 1. Adding a threshold creates only at most one new root. 5.2 Passive Dendritic Trees. In the integrate-and-fire model we have been considering, the integrating kernel consists of a single exponential time constant, corresponding to a single RC circuit. One generalization of this model is to passive dendritic trees. Using the classic result that the convolution kernel (i.e., the Green's function) can be approximated as the sum of z exponentials,
W(t) =
c,e-f/Tl
+ . . . + cze-f/Tz
Then the voltage can be represented by
Discretizing as before, we have
v=
c 1,wi
N-l
i=O
where
wj= Cl(zu1)i + . . . + cz(wz)' The effect of the dendritic tree is therefore to increase the number of roots for a given bandwidth from NM to NzM, since now for each of the M inputs there are now zN rather than N roots. The requisite bandwidth N
Anthony M. Zador and Barak A. Pearlmutter
620
to shatter M inputs is now (5.1) where as before 1 1 indicates rounding u p to the largest integer. This is less by a factor of z than in the case of a single exponential. 5.3 The Effect of Input Noise. Finding that a concept class has unbounded VC dimension should be taken as a sign that issues of noise, precision, and physical realizability are the only bounds on PAC generalization. For instance, convex polygons in the plane have unbounded VC dimension. This is in contrast to a finite VC dimension, which means that even with unlimited precision and zero noise, there is a PAC generaliza tion bound. Here we consider the effect of noise added to the inputs. In general, noise in a system with signal power constraints determines a maximum resolvable frequency, which in the present context determines N, the signal bandwidth. The VC dimension depends only logarithmically on N, so although equation 4.6 is formally a divergence of the VC dimension, actually this divergence is only logarithmic, and therefore weak. In practice, for any physically realizable system, the VC dimension given by equation 4.6 will be quite small. Another way to think about this is to suppose that
I'=I+n
(5.2)
where n is gaussian white noise. From equation 4.5 we have N-l
(5.3) N-l
N-l
(5.4) = z+V
(5.5)
where n, and I, are the ith components of the input and noise, respectively. The first term (the random variable z ) is the weighted sum of N iid gaussian variables, so it is also a gaussian, with variance a:; the second term is just V in the noise-free case. So how does this affect the VC dimension? Noise in this sense does not fall into the classical VC framework (but see Bartlett et al. 1994). Nevertheless, the effect is clear: there is some probability P,, of misclassifying each input. This probability depends on V and on 2: it is the probability that sgn(V + z ) # sgn(V). If both z and V are 0-mean, then this is just Prob(lV( - It1 < 0)/2 (we divide by 2 because half the time z has the same sign as and so does not change its sign).
VC Dimension of a Model Neuron
621
What is the misclassification error associated with z? This depends on the ratio V / z , which looks like a kind of signal-to-noise ratio. However, the natural measure of the signal strength is x f , and it is this quantity that should participate in the signal-to-noise ratio. Because of the manner in which they are constructed, for typical signals 1;is largest around N / 2 . If the roots are uniformly distributed between 0 and 1, we can actually estimate the typical signal strength as a function of N just by multiplying out the polynomial. Numerical simulations suggest that very large signal-to-noise ratios are required to keep the error reasonable for even moderate values of N, even larger than those called for by the bandwidth requirements, which after all constitutes only an upper bound. 6 Discussion
This is to our knowledge the first application of the VC dimension to a dynamic system. We consider the thresholded output of an integrateand-fire model to impose a binary partition on a set of continuous-time input signals. We have shown that the VC dimension of this model diverges logarithmically with the input signal bandwidth N. Because our analysis is stronger than the usual VC dimension calculation, the consequences for generalization are slightly more robust to prior knowlege than the generic PAC bound.3 6.1 Implications for Single Neuron Computation. There is an extensive literature demonstrating the computational potential of single neurons and networks. Koch et al. (1982) showed how an AND-NOT of two inputs could be performed in the passive dendritic tree of a retinal ganglion cell, and suggested that this might play a role in the computation of directional selectivity. Shepherd and Brayton (1987) implemented a complete set of logic operations at single spines using Hodgkin-Huxley channels and inhibitory inputs for NOT. Zador ef al. (1992) showed how calcium- and voltage-dependent channels could implement a kind of temporal XOR in the dendritic tree, without additional inhibitory inputs. Maass (1996) has shown that networks of simple spiking neurons possess rich computational properties in the sense of complexity theory. 3T0 show that the VC dimension of a concept class is at least M, one must show that there exists some set of ZM concepts that shatters some set of M inputs. Here we have shown something more general, since our construction proceeds for any set of 2M concepts from our concept class. Given any set of ZM concepts, we can find a set of M inputs that these concepts shatter. This has consequences in the application to PAC learning (Valiant 1984), where it corresponds to generalizing one of the two worst-case assumptions of the PAC criterion. So the PAC lower bound on generalization here requires the usual worst-case assumption over distributions of inputs, but remains true for any reasonable distribution over concepts.
622
Anthony M. Zador and Barak A. Pearlmutter
None of these demonstrations attempted a quantification of the overall computational capacity of a single neuron. To our knowledge, the only attempt to quantify the ability of a single neuron to partition an input space is Me1 (1992). He implemented a model of a cortical neuron with nonlinear NMDA conductances in the dendritic tree, and with a biologically motivated Hebb rule trained it to partition 100 high-dimensional patterns into two classes. The error rate on this set using various measures was about 10%. Note that the model class we consider-purely passive dendritic trees with integrate-and-fire nonlinearities-is more restricted than the NMDA-based nonlinearities considered by Me1 (1992). We have described a more formal approach to the analysis of single neuron computation. This approach takes into account the temporal structure of inputs. It puts a bound on the ability of a simple model to partition an input space. Because of the exponential dependence of the signal bandwidth required to achieve a given VC dimension (equation 4.61, under reasonable physical assumptions the VC dimension must be quite small. The exquisite sensitivity to noise in the inputs further limits the number of inputs that could be shattered by any physically realizable system. This number can be considered to be less than 10. The model we consider is of course a caricature of a real neuron-the dynamics of real neurons are much more complex (see, e.g., McKenna et 01. 1992). The leaky integrate-and-fire model with reset is nevertheless a standard starting point for considering dynamic aspects of neuron behavior. A recent careful examination of its validity (Koch et al. 1995) supports the notion that for rapidly varying input signals of the kind considered here it offers a good first approximation. 6.2 The VC Dimension and Dynamic Systems. A useful rule of thumb is that the VC dimension often turns out to be roughly proportional to the number of free parameters. This is true, for example, in feedforward linear threshold networks, where the VC dimension is equal to the number of free weights, up to a logarithmic factor (Baum and Haussler 1989). In our case, we expected the VC dimension to be about two, since there were two free parameters (0 and 7 ) . Furthermore, a small VC dimension for the integrate-and-fire model conforms to our intuitive notion of the simplicity of this model. In fact, equation 4.1 shows that the integrate-and-fire model can be considered as a kind of perceptron, and thus can impose linear partitions only on the input space. We were therefore surprised to find that for noiseless inputs, the VC dimension was unbounded. However, the apparent power of the integrate-and-fire unit arises not from a nonlinear partitioning, but rather . ~ simifrom a linear partitioning in a space of unbounded d i m e n ~ i o n A lar dilemma arises when the discrete formulation of Shannon entropy is ‘We speculate that local dendritic processing gives rise to a real increase in computational power-ne that arises from a nonlinear partitioning of a space of fixed dimension.
VC Dimension of a Model Neuron
623
applied to continous variables: the information content of a noiseless random variable is infinite (since, for example, any message can be encoded in its decimal expansion). Any finite noise, of course, renders finite the discrete entropy of the continuous variable. Just as the discrete entropy of a continuous variable becomes finite in the presense of noise, so the unbounded VC dimension collapses when any notion of noise is included. We considered two ways noise could limit the VC dimension. First, the bandwidth of the signal is implicitly due to noise, and the VC dimension diverges logarithmically with the input signal bandwidth. Second, we considered the effect of noise added explicitly to the signal, and found that the probability of misclassification was a very steep function of the VC dimension. In both cases, the apparent VC dimension in the presense of noise conformed much more closely with our intuitive notion that it should be rather small. It will be interesting to see whether related notions of computational capacity, such as those derived from work on average generalization (Haussler et al. 1994), can be extended to dynamic systems in a similar way.
Acknowledgments We are grateful to the reviewers for their constructive and thoughtprovoking comments, and Paul Zador for many useful discussions. We also thank Christof Koch for support in this work, which was initiated while the authors were in Tom Brown's laboratory at Yale University. This research was funded by grants from the ONR and the NIMH Center for Neuroscience to Christof Koch.
References Abu-Mostafa, Y. S. 1989. The Vapnik-Chervonenkis dimension: Information versus complexity in learning. Neural Comp. 1(3),312-317. Anthony, M. 1994. Probabilisticanalysis of learning in artificial neural networks: The PAC model and its variants. Mathematics preprint series LSE-MI'S67, Department of Mathematics, London School of Economics and Political Science, London, UK. Also available as Neuro COLT Tech. Rep. NC-TR-94-3, ftp cscx.cs.rhbnc.ac.uk/pub/neorucolt/techieports/. Bartlett, P. L., Long, P. M., and Williamson, R. C. 1994. FAT-shattering and the learnability of real-valued functions. In Seventh Annual ACM Wovkskop on Computational Learning Theory, New Bmnswick, NJ, 299-310. Baum, E., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1(1), 151-160. Blumer, A., Ehrenfeucht, A,, Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkisdimension. 1.ACM 36,929-965.
624
Anthony M. Zador and Barak A. Pearlmutter
Brown, T. H., Zador, A. M., Mainen, Z . E, and Claiborne, B. J. 1992. Hebbian computations in hippocampal dendrites and spines. In Single Neirror?Compufntioiz, T. McKenna, 1. Davis, and S. F. Zornetzer, eds., pp. 81-116. Academic Press, San Diego. Haussler, D., Kearns, M., Seung, H. S., and Tishby, N. 1994. Rigorous learning curve bounds from statistical mechanics. In Sezwitlz A m i d ACM Workshop oti Coiiipietntiotznl Lenrtiitig Tlirory, New Brunswick, NJ, 67-75. Koch, C., Poggio, T., and Torre, V. 1982. Retinal ganglion cells: A functional interpretation of dendritic morphology. Proc. Royal Soc. Lotidori B 298, 227264, Koch, C., Bernander, O., and Douglas, R. J. 1995. Do neurons have a voltage or a current threshold for action potential initiation? J . Cot~ip.Neurosci. 2, 63-82. Maass, W. 1995. Vapnik-Chervonenkis dimension of neural networks. In Hnlidhook of Bmiri Tltrory niid Ncicrnl Netmirks, M. A. Arbib, ed., pp. 1000-1002. MIT Press, Cambridge, MA. Maass, W. 1996. Lower bounds for the computational power of networks of spiking neurons. Ncirr-a/ Cottip. 8(1), 1 4 0 . McKenna, T., Davis, J,, and Zornetzer, S. E (eds.). 1992. Sitigle Netcroti Co~t~piltnfiori. Academic Press, San Diego. Mel, B. W. 1992. NMDA-based pattern discrimination in a modeled cortical neuron. Nrrirnl C ~ i i t p4(4), . 502-516. Shepherd, G., and Brayton, R. 1987. Logic operations are properties o f computersimulated interactions between excitable dendritic spines. J. Ncicrosci. 21, 1-51-1 66. Valiant, L. G. 1984. A theory of the learnable. C o r i i r r i i t t i . ACM 27(11), 1134-1142. Vapnik, V., and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Tlrcory Prob. A ~ F J16, / . 264-280. Zador, A. M., Claiborne, B. J., and Brown, T.H. 1992. Nonlinear pattern separation in single hippocampal neurons with active dendritic membrane. In A f ~ m r c 5iri Ncicrnl Itiforttintioii Proirssitig Systcrtts 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, ecls., pp. 51-58. Morgan Kaufmann, San Mateo, CA.
Recelled October 1-1, 1991, nccepted August 31, 1995
This article has been cited by: 2. Pirkko Kuusela, Daniel Ocone, Eduardo D. Sontag. 2004. Learning Complexity Dimensions for a Continuous-Time Control System. SIAM Journal on Control and Optimization 43:3, 872. [CrossRef] 3. Panayiota Poirazi , Bartlett W. Mel . 2000. Choice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic ClassifierChoice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic Classifier. Neural Computation 12:5, 1189-1205. [Abstract] [PDF] [PDF Plus]
Communicated by Wolfgang Maass
The VC Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs Peter L. Bartlett Robert C. Williamson Department of Systems Engineering, Research School of Information Sciences and Engineering, Australian National University, Canberra 0200, Australia We give upper bounds on the Vapnik-Chervonenkis dimension and pseudodimension of two-layer neural networks that use the standard sigmoid function or radial basis function and have inputs from { -D. . . . ,D}’?. In Valiant‘s probably approximately correct (pad learning framework for pattern classification, and in Haussler‘s generalization of this framework to nonlinear regression, the results imply that the number of training examples necessary for satisfactory learning performance grows no more rapidly than W log(WD), where W is the number of weights. The previous best bound for these networks was O(W4). In using neural networks for pattern classification and regression tasks, it is important to be able to predict how much training data will be sufficient for satisfactory performance. Valiant’s probably approximately correct (pac) framework (Valiant 1984) provides a formal definition of ”satisfactory learning performance” for ( 0 , l }-valued functions. Blumer et al. (1989) present upper and lower bounds on the number of examples necessary and sufficient for learning under this definition. These bounds depend linearly on the Vapnik-Chervonenkisdimension of the function class used for learning. Haussler (1992) presents a generalization of the pac framework that applies to the problem of learning real-valued functions. In this case, the pseudodimension of the function class gives an upper bound on the number of examples necessary for learning. Baum and Haussler (1989), Maass (1992), Sakurai (1993), Bartlett (1993), and Goldberg and Jerrum (1993) present upper and lower bounds on the VC-dimension of threshold networks and networks with piecewise polynomial output functions. Most of these results can easily be extended to give bounds on the pseudodimension. However, these networks are not commonly used in applications because the most popular learning algorithm, the backpropagation algorithm, relies on differentiability of the units’ output functions. In practice, sigmoid networks and radial basis function (RBF) networks are most widely used. Recently, Macintyre and Sontag (1993) showed that the VC-dimension and pseudodimension of these networks are finite, and Karpinski and Macintyre Neural Computation 8, 625628 (1996) @ 1996 Massachusetts Institute of Technology
Peter L. Bartlett and Robert C. Williamson
626
(1995) proved a bound of O(W2N2),where W is the number of weights in the network and N is the number of processing units. In this note, we show that the VC-dimension and pseudodimension of two-layer sigmoid networks and radial basis function networks with discrete inputs is O[W log(W D ) ] where , the input domain is { -D. . . . .D}J1.In many pattern classification and nonlinear regression tasks for which neural networks have been used, the set of inputs is a small finite set of this form. For the special case of sigmoid networks with binary inputs, our bound is within a log factor of the best known lower bounds (Bartlett 1993). The result follows from the observation that a network with discrete inputs can be represented as a polynomially parameterized function class; hence VC-dimension bounds for such classes can be applied. Suppose X is a set, and F is a class of real-valued functions defined on X. For x = ( X I ? ... ,x,,,)E X”’and Y = ( r l . . . . . rJ,,) E R”’,we say that F shatters ((XI, r l ) , . . . . (x,,,.r,,,)) if for all sign sequences s = (sl, . . s,,,) E (-1. l}”’,there is a function f in F such that s,~(x,)-Y,] > 0 for 1 = 1.. . . , m. The pseudodimension of F is the length m of the largest shattered sequence. For pattern classification problems, we typically consider a class of (0.1 }-valued functions obtained by thresholding a class of real-valued functions. Define the threshold function, ‘FI : R + {O,l}, as ‘FI(0)= 1 if and only if N 2 0. If F is a class of real-valued functions, let X ( F ) denote the set {Xu): f E F } . The Vapnik-Chervonenkis dimension of a class F of real-valued functions defined on X is the size of the largest sequence of points that can be classified arbitrarily by X ( F ) , ~
VCdim(F)=max{rn: 3 x ~ X ’ ” R(F) . shatters ((xl.l/2)>. . . %xm.1/2))} ( Clearly, VCdim(F) 5 dimp(F). The function classes considered in this note can be indexed using a real vector B of parameters. Let 0 and X be the parameter and input spaces, respectively, and let f : 0 x X + R. The function f defines a parameterized class of real-valued functions defined on cf(S, .) : Q E @}. We also denote this function class by f .
x,
Definition 1. A two-layer sigmoid network with 11 inputs, W weights, and a single real-valued output is described by the function fs : Rw x X R,where X W,
--f
k
js(O,x) = a.
+ CU,/(I +
e-(blx+bio))
,=I
with a, E R, b, = (bill.. . , b,,,) E W”, and 0 = ( 0 0 , . . . , a k , b l o , . . . , bkn) E R ~ (For . x, y E R”, x . y = C:=,x,y,.)In this case, W = kn + 2k + 1. A radial basis function (RBF) network is described by the function k
VC Dimension of Two-Layer Neural Networks
627
with a, E R, c, = (GI. . . . cln) E R", and 8 = (ao. . . . ak. ~11.. . . . ck,,) E iWn, l/x/I2= x . x.) Here, W = kn + k + 1. %
~
E
Ww. (If
x
Theorem 2. Let X = {-D, . . . , D}" forsomepositiveintegerD. For thesigmoid and RBF networks fs?fRBF : Rw x X + R,we have
VCdimCfs) 5 dimpfjs) < 2W log2(24eWD) VCdimfjRBF) 5 dimpCfRBF) < 4w 10g2(24eWD) The proof of Theorem 2 follows from the simple observation that the function classes fs and ~ R B Fcan be expressed as a polynomial in some transformed set of parameters when the inputs are integers. We can then use an upper bound from Goldberg and Jerrum (1993) on the VCdimension of such a function class. Proof. Consider the function fs defined in Definition 1. For any 8 E 0, x E X and Y E R, let
k
/ n
tl
\
Clearly, fs,[O,( x . r ) ] always has the same sign as f s ( 0 , x ) - Y, since the denominators in fs(8,x) are always positive, so dimpus) 5 VCdimus)). But fs, [8, (x,r)] is polynomial in 8' = (ao, . . . ak. ecblo? . . . .ecbk,l),with degree no more than 2Dnk + k + Dn + 1 < 3DW. Theorem 2.2 in Goldberg and Jerrum (1993) implies that VCdimCfst) < 2Wlog2(24eWD). Similarly, ~ R B F ( H ,x) - r has the same sign as fRBF/[Q,(x,r)], where fmF![e,
(x,I)Y
k
=
n
x) - rI r]:~ e Z c ~ l D
[~RBF(~,
1=11=1
n:n: k
= (ao- r )
n
e2ciJD
1x1 /=1
Again, ~ R B [O, F ~ (x,r ) ] is polynomial in 8' = (ao, . . . ,uk, e2cll,. . . ,e2+rz epc:1 , . . . , e-'L) ~
Peter L. Bartlett and Robert C. Williamson
628
with degree n o more than IID(X. 2 ) + 2 < 3DW. As above, dimp(fs) <
3 W log, (24cDWj .
0
The same argument can be used to show that two-layer sigmoid
or RBF networks with W weights, { -D.. . . . D}-valued inputs, a n d arbitrary connectivity have pseudodimension (and hence VC-dimension) OjWlog( WDjj. From the results in Karpinski a n d Macintyre (1995), it is clear that the dependence o n D, the size of the input set, is not necessary in these upper bounds.
Acknowledgments
_____
This research w a s supported by the Australian Telecommunications and Electronics Research Board a n d by the Australian Research Council. Thanks to Sridevan Parameswaran for help in obtaining references.
References Bartlett, P. L. 1993. Vapnik-Chervonenkis dimension bounds for twc3- and threelayer networks. Ncirrnl C o i i i ~ 5(3), ~. 353-355. Blumer, A., Ehrenfrucht, A, Haussler, D., and Warmuth, M. K. 1989. Learnr~ ability and the Vapnik-Chervonenkis dimension. ]. Assoc. C ~ i r i z p i t i i Mnclz. 36(4), 929-965. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? N P T { ~CRO/ M ~1,. 151-160. Goldberg, P., and Jerrum, M. 1993. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Proc. Sixth ACM Workshop Conip. h ? ? ' t l I i J ~T\ft'Or!/ 361-369. Haussler, D. 1992. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inforrii. Coirip. 100,78-150. Karpinski, M., and Macintyre, A. 1995. Polynomial bounds for VC dimension of sigmoidal neural networks. Proc. 27th Ani~ir.Synip. Tlicory Cci~tipit.,pp. 200208. Maass, W. 1992. Bowidsfor tlze coiiipirtntioiinl pozurr arid lenrriiiig conipli.rityofniin/o~ rieiirnl riets. Tech. Rep., Graz University of Technology. Macintyre, A,,and Sontag, E. D. 1993. Finiteness results for sigmoidal 'neural' networks. Proc. 25th Ariiiu. Symp. Theory of Comput., pp. 325-334. Sakurai, A. 1993. Tighter bounds on the VC-dimension of three-layer networks. World Cnngr. Neiirnl Netzilorks. Valiant, L. G. 1984. A theory of the learnable. Conininn. ACM 27(11), 1134-1143.
Received March 31, 1994; accepted August 14, 1995.
This article has been cited by: 2. Wing W. Y. Ng, Daniel S. Yeung, Defeng Wang, Eric C. C. Tsang, Xi-Zhao Wang. 2006. Localized Generalization Error of Gaussian-based Classifiers and Visualization of Decision Boundaries. Soft Computing 11:4, 375-381. [CrossRef] 3. Michael Schmitt . 2002. Descartes' Rule of Signs for Radial Basis Function Neural NetworksDescartes' Rule of Signs for Radial Basis Function Neural Networks. Neural Computation 14:12, 2997-3011. [Abstract] [PDF] [PDF Plus] 4. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 5. V. Maiorov, R. Meir. 2001. Lower bounds for multivariate approximation by affine-invariant dictionaries. IEEE Transactions on Information Theory 47:4, 1569-1575. [CrossRef]
Communicated by Steven Luttrell
A Theoretical and Experimental Account of n-Tuple Classifier Performance Richard Rohwer Michal Morciniec Department of Computer Science and Applied Mathematics, Aston University, Birmingham B4 7ET, UK
The n-tuple recognition method is briefly reviewed, summarizing the main theoretical results. Large-scale experiments carried out on StatLog project datasets confirm this method as a viable competitor to more popular methods due to its speed, simplicity, and accuracy on the majority of a wide variety of classification problems. A further investigation into the failure of the method on certain datasets finds the problem to be largely due to a mismatch between the scales which describe generalization and data sparseness.
1 Introduction
The n-tuple classifier, invented by Bledsoe and Browning in 1959 (Bledsoe and Browning 19591, is one of the oldest practical pattern recognition methods based on distributed computation and amenable to description in terms of neural network metaphors. Although eclipsed in popularity by methods such as Multilayer Perceptrons and Radial Basis Function networks, the n-tuple method continues to offer properties that make it vastly superior for certain common purposes. First among these properties is its lightning speed. The training algorithm is a one-shot memorization task, computationally trivial compared to solving linear systems or minimizing nonlinear functions. Another advantage is the sheer simplicity of learning by memorization. If this can form the basis of a sound pattern recognition principle, then it is arguable that biological systems could make use of it. It is prudent to suspect that relatively poor performance will accompany the speed and simplicity of the n-tuple algorithm, so a large comparative experimental study was carried out. The results, reported in Section 5, are reassuringly strong for most datasets, but very poor for a few. A major contributing factor to the failures was found. This is explained in Section 6 in terms of the theory developed in Sections 3 and 4. Neural Computution 8, 629-642 (1996)
@ 1996 Massachusetts Institute of Technology
Richard Rohwer and Michal Morciniec
630
2 The n-Tuple Recognition Method
The n-tuple recognition method is also known as a type of “RAMnet”’ or “weightless neural network.” It forms the basis of a commercial product (Aleksander et al. 1984). It is a method for classifying binary patterns, which can be regarded as bit strings of some fixed length L. This is not an important restriction, because there is an efficient preprocessing method, tailored to the RAMnet’s generalization properties, for converting scalar attributes into bit strings. This method is reviewed in Section 4. 2.1 Definition of the Standard n-Tuple Method. Several (let us say
N)sets of n distinct2 bit locations are selected randomly. These are the 1z-tuples. Collectively, they are called the “input nzapping” 77. Though random, the input mapping is a fixed property of any one classifier. The restriction of a pattern to an n-tuple can be regarded as an n-bit number that, together with the identity of the n-tuple, constitutes a ”feature” of the pattern. These features, the set of n N-bit patterns defined by 77, comprise all the information from the original L-bit pattern available to the recognizer. The standard ti-tuple recognizer operates simply as follows: A pattern is classifiedas belonging to theclassfor zuhich it has the most features in cornttzon with at least 1 training pattern of that class
(2.1)
This is the 0 = 1 case of a more general rule whereby the class assigned to unclassified pattern u is
where 23, is the set of training patterns in class c, O d ( x ) = x for 0 5 x 5 8, O S ( x )= H for x 2 8, O,, is the Kronecker delta’ (6,)= 1 if i = j and 0 otherwise), and nI(u)is the ith feature of pattern u: (2.3)
Here uk is the kth bit of u and v , ( j ) is the jth bit location of the ith n-tuple. In the standard manner, equation 2.3 assigns a number between 0 and 2” - 1 to the ordered collection of n bit values that are selected by 77 from ‘RAMnets also include stochastic generalizations, PRAMS,to which the ti-tuple recognition algorithm is not applied. These are not considered here. 2Relaxing the requirement that an 11-tuplehas I I different bit locations amounts to introducing a mixture of differently sized n-tuples. Note the restriction does not disallow a single pattern bit from being shared by more than one n-tuple. ’The comma is unconventional but is used here optionally for extra clarity.
n-Tuple Classifier Performance
631
the L bits comprising the entire pattern u. Expression 2.2 simply counts the number of matching features in patterns u and v. With C classes to distinguish, the system can be implemented as a network of NC nodes, each of which is a (log, 8)-bit random access memory (RAM); hence the term XAMnet. [Equivalently, it is a network of N RAMS, each containing a C-dimensional vector of (log, /'+bit integers.] To train the network, the memory content mcia at address a of the ith node allocated to class c is set to (2.4) In the usual 8 = 1 case, the 1-bit content of mcia is set if any pattern of Dc has feature CY and unset otherwise. Recognition is accomplished by summing the contents of the nodes of each class at the addresses given by the features of the unclassified pattern. That is, pattern u is assigned to class
3 Theoretical Considerations
The n-tuple classifier is a memory-based method. Such methods differ from optimization-based methods, such as backpropagation of error through multilayer perceptrons, in two important ways. First, "hidden" representations (or "features") are selected randomly, and second, training is a simple one-shot memorization task involving these features. These differences give memory-based methods their impressive advantage in training speed. Radial basis functions obtain part of this speed advantage by selecting features randomly (Broomhead and Lowe 1988), and multilayer perceptrons can often be trained faster with little or no loss of performance by using fixed random weights into the hidden layers (Gallant and Smith 1987; Sutton and Whitehead 1993). However, this does not give the speed and simplicity that training by mere memorization provides. Another system that embraces the basic principles of the n-tuple recognizer is Kanerva's sparse distributed memory (Kanerva 1988), which has close formal similarities and admits similar theoretical treatment (Morciniec and Rohwer 1994). In spite of many useful advances (Austin 1994; Aleksander and Stonham 1979; Flanagan et al. 1992; Bledsoe and Bisson 1962; Ullmann and Dodd 19691, there is no theory of n-tuple networks of the standard of the sophisticated statistical techniques available with optimization-based methods (MacKay 1992). It is not particularly difficult to design new training algorithms for the n-tuple architecture to make these statistical methods applicable (Tattersall et al. 1991; Luttrell 1992; Rohwer 1995), but
Richard Rohwer and Michal Morciniec
632
this approach sidesteps the interesting questions instead of answering them. These modified methods reduce the speed and simplicity advantages as well. 3.1 Qualitative Theory and Practical Experience. Simple theoretical considerations and practical experience provide fairly strong guidance for setting the architectural parameters, n, N, and 0. To begin with, the fact that the network response to an arbitrary pattern is essentially an average over the n-tuples (2.5) means that the results should become increasingly consistent with increasing N. Because a run typically takes only a few seconds, it is practical to rerandomize 77 a few times to check that results are consistent. Practical experience is that values of 100 to 1000 usually turn out to be adequate. Practical experience tends to favor small values of the threshold 0, particularly 0 = 1. Many considerations apply to the choice of ti-tuple size ii. Experimentally it usually turns out that bigger is better, up to an impractically large size (Rohwer and Lamb 1993), but 8 is usually enough, and 3 is sometimes adequate. This can be explained qualitatively by observing that information about the correlations among up to n bits is available to the classifier. It never hurts to take account of higher-order correlations, but it is plausible that eighth-order correlations contain all that is needed for most data sets. Another intuition is that the training process should write to neither too small nor too great a proportion of the 2” addresses at each node. If n is too large, the subpatterns occurring in the training data will be unlikely to recur in the test data, whereas if 11 is too small, the memory can saturate, in which case rnciLv= 0 for most memory locations, so most discriminative power is lost (Ullmann 1969; Tarling and Rohwer 1993). These issues are further complicated if the class priors are highly skewed, so that one class has far more training data than another. 3.2 Tuple Distance and Hamming Distance. The main theoretical results on the n-tuple method provide a relationship between a “tuple distance” relevant to the network’s generalization properties, and the Hamming distance between training and test patterns. The tuple distance p(u,v ) between patterns u and v is the number of tuples (of a given input mapping) on which the patterns disagree: N
P(U,V) =
N-
(3.1) 1=1
The number on which they agree, N - p(u,v),will be called the ”tuple score.” An elementary argument based on the random selection of the n-tuple inputs from the L bits available shows that patterns v that lie a fixed Hamming distance H ( u , v )from any one pattern u are distributed
n-Tuple Classifier Performance
633
binomially in tuple distance:
More complicated expressions are available for more constrained n-tuple sampling procedures (Tattersall and Johnson 1984). The expected value of p under this distribution is easily shown to be about p ( ~=) ( p I
H)w N (I -
e-llf)
(3.3)
for nearby patterns (H << L). It is clear from 3.3 that proximity in Hamming distance plays a role in the generalization behavior of n-tuple networks. Consider a network trained on just one example v of class c, and tested on a pattern u Hamming distance H from v. Classifications are based on the network response to pattern u, which will be about Ne??. Hence one could say that the network generalizes from training pattern v to all patterns within a Hamming distance of about L / n of v, the "generalization distance." A network trained on a set of patterns {v,, . . . , VT} could respond to pattern u by any amount between N m i n ( l . m a x , e - " v ) and N min(1. C,PW),depending on the correlations between the training patterns, as manifest in "overlap effects." Unfortunately, this circumstance limits the usefulness of tuple distance for explaining the standard n-tuple method. Because of a combinatorial explosion, there is no feasible method of measuring tuple correlations. Similar problems crippling an attempt of full formal analysis of the method (Stonham 1977) for datasets of arbitrary size have been reported. However, smallness of the upper bound will be seen to be associated with cases of poor performance in the experiments reported here.
zEl~cla8(u)
4 Preprocessing of Scalar Attributes
A RAMnet classifies bit strings, but the attributes of the patterns in the StatLog data sets are mostly real numbers or integers. Given that generalization from numerical attributes should be related to arithmetic differences, and generalization in RAMnets is related to Hamming distances, it is important to transform numbers into bit strings in such a way that numerical proximity is transformed into Hamming proximity. A memoryefficient method tailored to the generalized Hamming distance underlying RAMnet generalization has been devised (Allinson and Kolcz, 1994), using a combination of CMAC and Gray coding techniques. The prescription for encoding integer x is to concatenate K bit strings, the jth of which (counting from 1) is ( x j - l)/K, rounded down and expressed as a Gray code. The Gray code of an integer i can be obtained as the
+
634
Richard Rohwer and Michal Morciniec
bitwise exclusive-or of i (expressed as a n ordinary base 2 number) with i!’2 (rounded down). It can be shown that this provides a representation in nK bits of the integers between 0 and (2‘ - 1 ) K inclusive, such that if integers x and y differ arithmetically by K or less, their codes differ by Hamming distance 1.v - y/, and if their arithmetic difference is K or more, their corresponding Hamming distance is at least K . This is illustrated in Figure 1, and more comprehensive illustrations are given by Allinson and Kolcz (1993). Because tuple score decays exponentially with Hamming distance (3.3), there should be relatively little ill effect if a training pattern further than Ljri bits away from a test pattern is replaced by another training pattern further than Liii bits away (although L / I I is a zwy approximate estimate). Thus if there are A scalar attributes, one can expect the noiilinearity of the CMAC/Gray mapping to do little harm if K > ( L / A ) / u . There are n K bits per attribute, so this condition is n << l i (4.1) Scalar differences up to i K fall within the linear region of the mapping. This represents a fraction 2 K ; [ ( 2 ” - 1 ) K j or about 2l-I’ of the largest separation allowed. With 17 < 1 1 , the “generalization Hamming distance” (L/A),’lz= nKr/ii corresponds to a scalar separation of k n K / n , which is the fraction 2 n / [ 1 2 ( 2-~ 1 ) ]=Z (n!n)2’-.” (for n > 1) of the largest possible scalar separation. For a = 1, one has K = L / A and the mapping becomes the ”thermometer code,” in which integer x is mapped to a bit string with the last s bits set and the remaining K - x unset. With the bit strings for each attribute concatenated, the Hamming distance between two patterns is proportional to their Manhattan distance, as long as the arithmetic differences are less than K for each attribute. A technique has also been developed to give Hamming distance more closely proportional to Euclidean distance (Kolcz and Allinson, 1994).
5 The Experiments There are many reports of satisfactory results with the n-tuple method (Aleksander and Stonham 1979; Biedsoe and Bisson 1962; Ullmann and Dodd 1969; Ullmann 1969; Tarling and Rohwer 1993; Rohwer and Lamb 19931, but few studies involving comparisons with other methods (Rohwer and Cressy 1989). Furthermore, most studies use just one or two small data sets. Therefore a large experiment was carried out in which the n-tuple method was tested on 11 of the largest real-world data sets that had been previously used by the European Community ESPRIT StatLog project (Michie et nl. 1994) to test 23 other classification algorithms including the most popular neural network methods. The data sets, algorithms, and results are summarized in Tables 1 and 2 and Figure 2.
n-Tuple Classifier Performance
635
CMAC code properties n
40
n
-n
3
35 -
30 -
$25f m m ._ c
U
-
$0 ._
E
5
115-
0
5 -0 0 0 0
Integer distance
Figure 1: Hamming distance between two CMAC/Gray-transformed integers vs. their arithmetic difference, for 3 x lo4 randomly chosen pairs of integers. K = 8 and a = 5 . The relationship is linear for Hamming distances up to K, and the transformed distance is bounded below by K for greater Hamming distances. 5.1 Experimental Details. The CMAC/Gray parameters used were K = 8 and a = 5, giving 40-bit representations of the integers in [O. 2481. All scalar attributes were linearly rescaled and rounded to obtain integers in this interval. For the convenience of uniformity, this procedure was applied even to attributes that are Boolean (DNA) or small integers (Letter). The threshold B was set to 1 in all the experiments reported here, the n-tuple size n was set to 8, and N was set to 1000 n-tuples. Using n = 6 gave similar results. The results reported in Figure 2 are averages over 10 different random input mappings q. The standard deviations are small. If plotted as error bars they would be obscured by the marks representing the means.
Richard Rohwer and Michai Morciniec
6.36
5.2 Time and Memory Requirements. Computation time requirements were insignificant in these experiments, which were carried out with a C++ program on a SUN Sparc workstation. For example, an 8tuple network can be trained on the 2000 57-attribute training patterns of the Belgian11 data set in about 49 sec. Sixteen of these seconds are
Table 1: Descriptions of Data Sets Used Name
Largest Prior Classes
Training Patterns Attributes Testing Patterns
Belgianll
2
0.924
57 real
2000
1000
Cut50
2
0941
50 real
11220
7480
Cut20
2
0941
20 real
11220
7480
Technical
91
0.230
56
4500
2580
DNA
3
0.525
180 Boolean
2000
1186
SatIm
6
0.242
36 integer
4435
2000
Chromo
24
0.044
16
20000
20000
Belgian1
2
0.5664
28 real
1250
1250
Description Classify measurements on simulated large scale power system as leading to stable or unstable behavior 50 measurements from a candidate segmentation point in joined handwritten text; classify as suitable cut point or not; commercially confidential data Best 20 attributes (by stepwise regression) from Cut50 Commercially confidential; appears to be generated by a decision tree; most attribute values are 0 Sequences of 60 nucleotides (4-valued) classified into 3 categories 3 x 3 pixel regions of Landsat images; intensities in 4 spectral bands; classified into 6 land uses at central pixel Images of chromosomes, reduced to 16 features As Belgian I1 with a smaller simulation; attributes thought to be least informative omitted from simulation
n-Tuple Classifier Performance
637
Table 1: Continued. Name
Largest Prior Classes
Training Patterns Attributes Testing Patterns
Tsetse
2
0.508
14 real
3500
1499
Letter
26
0.045
16 16-valued
15000
5000
Shuttle
7
0.784
9 real
43500
14500
Description Classify environmental attributes for presence of tsetse flies Images of typed capital letters, described by 16 real numbers discretized into 16 integers Classification problem concerning position of radiators on the Space Shuttle; noisefree data
needed just to read in the data, another 4 to do the CMAC/Gray conversion of the floating point attributes, and the final 29 to train the RAMnet itself. Testing the same 2000 patterns takes slightly longer, 37 sec instead of 29, because a loop over classes is needed within the loop over n-tuples. Detailed timing statistics are not published for the algorithms used in the StatLog project, but it is clear that popular neural network algorithms such as backpropagation and even the relatively fast radial basis functions are slow by comparison. The algorithm is highly parallelizable, so if it were important for the RAMnet to be even faster, special purpose parallel hardware could be designed or purchased (Aleksander et al. 1984). It would be feasible fcr a biological system to implement a highly parallel but otherwise trivial calculation along these lines. The storage requirements were moderate in most cases. In the most extreme case (Shuttle) 128 kB of RAM per class was needed.
6 Analysis of Results
The n-tuple method delivered competitive accuracy on 6 of the data sets tested (Shuttle, Letter, Tsetse, BelgianI, Chromo, SatIm), performed modestly on 1 (DNA) and failed entirely on the other 4 (Belgian 11, Cut50, Cut20, Technical). It appears that the cases in which the method fails are conspicuous for their severity, but otherwise there is no systematic performance gap between the n-tuple method and the others. Further experimental and theoretical analysis was carried out to explain the failures.
Richard Rohwer and Michal Morciniec
638
Table 2: Synopsis of Algorithms with Symbols Used in Figure 2 RAMnets ( 0 )n-tuple
recognizer Discriminators
(&) 1-hidden-layer MLP
(4) Radial basis functions
(0) Cascade correlation (c<)Dip0192 (pairwise linear discriminant) (;.:) Quadratic discriminant
(el SMART (projection pursuit) (+)
Logistic discriminant
( 7 ) Linear discriminant
Methods related to density estimation
(01 CASTLE (prob. decision tree) (-,) (E)
LVQ (Kohonen) NaiveBayes (independent attributes)
( j)
k-NN (k nearest neighbors)
( 8 ) Kohonen topographic map
(Q ALLOC80 (kernel functions) Decision trees
(a) NewID (decision tree)
(c) Ca15 (decision tree)
(e) C4.5 (decision tree) (g) IndCART (CART variation) (i) ITrule (decision tree)
(b) AC2 (decision tree) (d) CN2 (decision tree) (f) CART (decision tree) (h) BayesTree (decision tree)
Given that Hamming neighbors tend to determine the classification outcome, it seems sensible to suspect that test patterns in the 4 problematic data sets have a shortage of good neighbors. It turns out that they simply do not have enough neighbors a t all, within the distance scales relevant to RAMnet generalization. To generalize properly, a test pattern must have at least 1 training pattern within a Hamming distance of about L/R. Distributed evenly over A CMAC/Gray-mapped ", the scalar attributes, this is a scalar difference of about ( ~ 7 / ~ ) 2 ' - with attributes scaled to lie between 0 and 1, as explained in Section 4. Therefore each training pattern can provide information about any test pattern that falls within a hypercube of volume roughly [(a/ri)2'-'']]".A number of such cubes required to cover the region of attribute space where test d a h are likely to appear can be crudely estimated by approximating this region as a hyperrectangle with edge lengths given by the eigenvalues of the sample covariance matrix of the training data. Any eigenvalues smaller than { a / 1 ~ ) 2 ' -should ~ be rounded u p to this value, because the covering cubes must be at least this thick. The number of "generaliza-
n-Tuple Classifier Performance
639
Classification Performance I I:
I Bdg""11
I
I
I
4
I 3 c
cut20
I
-
?-
doe I3
f
K
< d
B
'
I
o
h u b 4
6
E 7
h
b
a
D d b @@
b
e
Q I
-
r
-
0-
4r '
-
< -
I
0 4
I
I
<
B
0
I cut50
I c
0
* h I,
I
4
0 6
e mb I
-
Figure 2: Results for N-tuple ( 0 )and other algorithms. Algorithm codes appear in Table 2. Classification error rates increase from left to right, and are scaled separately for each data set, so that they equal 1 at the error rate of the trivial method of always guessing the class with the highest prior probability, ignoring the input pattern. (The accuracy of this trivial method is the same as the largest prior, which is listed in Table 1.) The arrows indicate the few cases in which performance was worse than this.
tion hypercubes" required to cover the data region is therefore roughly nP=,rnax[l,Xi(n/~)2~-'] for 1 < a 5 n, where the Xi are the eigenvalues. Figure 3 shows this lower bound on the number of training samples required, for each data set studied, taking a = 5 and n = 8 as in the experiments. Aside from Technical and DNA, the problematic data sets stand out as
Richard Rohwer and Michal Morciniec
640
Sparseness of the datasets
I
i
1oo
I
,
.
I
10’
.
,
.
.
I
1O’O
,
.
I
.
.
10”
.
,
I
.
1oZ0
.
,
I
I
10”
,
1030
Hypercubes
Figure 3: The number of hypercubes required to cover the space occupied by data. The data sets on which ii-tuple classifier performed poorly are printed in bold face. A star denotes the existence of skewed priors.
several orders of magnitude more deficient in training data than the others, some of which are mildly deficient according to this crude estimate. DNA is special in that its Boolean attributes were treated as integers, so its data distribution will be highly non-gaussian and therefore poorly described by the covariance matrix. The Technical data set turned out to be coverable by just 1 hypercube, according to this estimate. Presumably then, each of its patterns looks the same to the RAMnet, and this accounts for its failure. Perhaps a nonlinear rescaling of its attributes, such as histogram equalization, would help. It is not possible to address the data deficiencies by supplying more data, especially when several orders of magnitude more samples are needed, but it is possible to tweak the RAMnet parameters to enlarge the ”generalization cubes.” However, there is less room to maneuver than one would like. To enlarge the cubes, IZ must be decreased, but this risks degradation of performance due to loss of high-order correlation information. Decreasing 11 also requires decreasing a, if the constraint (4.1) is to be respected, keeping the generalization distance within the linear region of the CMAC/Gray mapping. Low a values give less memory-efficient representations of scalars, at any given resolution. Systematic experiments varying the parameters did not produce significant improvements on the four problematic data sets (or the others). A more far-reaching improvement in the algorithm is required.
n-Tuple Classifier Performance
641
7 Conclusions Extensive experimental trials, on a scale uncommon for any algorithm, were carried out with the n-tuple classifier. The fact that this was possible at all testifies to the method’s lightning speed, which derives from its simple principle of learning by 1-shot memorization of random features. For the majority of the data sets tested, the n-tuple method was not systematically better or worse than the 23 other methods, including other neural network methods. In 4 of the 11 data sets the method failed entirely. The problem appears to be due to a mismatch between the scale over which the RAMnet generalizes and the sparseness of the data. In spite of its imperfections, the n-tuple method demonstrates that its underlying principle, learning by memorization of random features, is a powerful one. It should be rewarding to develop the theory further, especially by inventing improved methods that incorporate the underlying principle in a more flexible way.
Acknowledgments The authors are grateful to Louis Wehenkel of Universite de Liege for useful correspondence and permission to report results on the Belgian1 and Belgian11 data sets, Trevor Booth of the Australian CSIRO Division of Forestry for permission to report results on the Tsetse data set, and Reza Nakhaeizadeh of Daimler-Benz, Ulm, Germany for permission to report on the Technical, Cut20, and Cut50 data sets.
References Aleksander, I., and Stonham, T. J. 1979. Guide to pattern recognition using random-access memories. Comput. Digit. Tech. 2, 2940. Aleksander, I., Thomas, W. V., and Bowden, P. A. 1984. WISARD: A radical step forward in image recognition. Sensor Rev. 4, 120-124. Allinson, N. M., and Kolcz, A. 1993. Enhanced N-tuple approximators. Weightless Neural Network Workshop, 38-45. Allinson, N. M., and Kolcz, A. 1994. Application of the CMAC input encoding scheme in the n-tuple approximation network. I E E Proc. Comput. Digit. Tech. 141(3), 177-183. Austin, J. 1994. A review of RAM based neural networks. Proceedings of the Fourth International Conference on Microelectronics for Neural Networks and Fuzzy Systems, pp. 58-66. IEEE Computer Society Press, Turin. Bledsoe, W. W., and Bisson, C. L. 1962. Improved memory matrices for the ntuple pattern recognition method. IEEE Trans. Electron. Cornput. 11,414-415. Bledsoe, W. W., and Browning, I. 1959. Pattern recognition and reading by machine. Proc. Eastern Joint Comput. Conf., 232-255.
642
Richard Rohwer and Michaf Morciniec
Broomhead, D. S., and Lowe, David. 1988. Multi-variable functional interpolation and adaptive networks. Conrpltx Syst. 2, 321-355. Flanagan, C., Rahman, M. A,, and McQuade, E. 1992. A model for the behaviour of N-tuple RAM classifiers in noise. J. lrrtelligeiit Syst. 2, 187-224. Gallant, S., and Smith, D. 1987. Random cells: An idea whose time has come and gone . . . and come again? In I€€€ friterrintiorinl Coizference oiz Neural Networks, M. Caudill and C. Butler, eds., pp. 11-671-11-678. IEEE, San Diego. Kanerva, P. 1988. Sparse Distribirtrd Memory. MIT Press, Cambridge, MA. Kolcz, A,, and Allinson, N. 1994. Euclidian input mapping in an ri-tuple approximation network. Proc. Sixth fEEE Digit. Sigizal Process. Workshop, 285-289. Luttrell, S . P. 1992. Gibbs distribution theory of adaptive n-tuple networks. Artificial Neirrd Networks, 2, I. Aleksander and J. Taylor, eds., pp. 313-316. Elsevier, Amsterdam. MacKay, D. 1992. Bayesian interpolation. Neirrnl Coinp. 4, 415-447. Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (eds). 1994. Machiiw Learniiig, Neural atid Statisticd Classificntioii. Prentice Hall, Englewood Cliffs, NJ. Morciniec, M., and Rohwer, R. 1994. The throretical arid esperirizeiitnl status of the ii-fiiplr classifier. Tech. Rep. NCRG/4336, CSAM, Aston University, Birmingham, UK. Rohwer, R. 1995. Two Bayesian treatments of the ri-tuple recognition method. Proc. l E E E 4th Irzt. Corrf. Arfific-in[Nerrrnl Nefurorks (publication 409), 171-176. Rohwer, R., and Cressy, D. 1989. Phoneme classification by Boolean networks. In Proctdings of flir E I I ropemi COfIft'rtvice' 011 Sprecli Coiii n i i i r i icatioii a I id Techriology, J. P. Tubach and J. J. Mariani, eds., pp. 557-560. CEP, Edinburgh, Scotland. Rohwer, R., and Lamb, A. 1993. An exploration of the effect of super large ntuples on single layer RAMnets. In Pruceediiigs ofthr Weightless Neural Netmork h'orkshop '93, Coiupirtiiig iclit/r kJgIU?/Ner~rorrs,N. Allinson, ed., pp. 33-37. Department of Electronics, U. of York, York, UK. Stonham, T. J. 1977. Improved Hamming-distance analysis for digital learning networks. €/rcfroii.Lrtt. 13(6), 155-156. Sutton, R. S., and Whitehead, S. D. 1993. Online learning with random representations. Eiitli f i i t . Coiif. Mi~c/iirirLt.nrii. ( M L 931, pp. 314-321. Morgan Kaufmann, San Mateo, CA. Tarling, R., and Rohwer, R. 1993. Efficient use of training data in the n-tuple recognition method. €lrc-ti.ori. Lrtt. 29(24), 2093-2094. Tattersall, G. D., and Johnson, R. D. 1984. Speech recognisers based on N-tuple sampling. Proc. Iizst. Ac(iiisf. Spririg Coizf., 405413. Tattersall, G. D., Foster, S., and Johnston, R. D. 1991. Single-layer lookup perceptrons. I € € € Proc. F , 138, 46-54. Lllmann, J. R. 1969. Experiments with the n-tuple method of pattern recognition. I € € € Trniis. Coriip. 18(12), 1135-1137. Lllmann, J. R., and Dodd, P. A. 1969. Recognition experiments with typed numerals from envelopes in the mail. Pnttcrri Recog. 1, 273-289.
Received December 12, 1994; accepted September 26, 1995.
This article has been cited by: 2. Simon M. Lucas. 2008. Computational intelligence and games: Challenges and opportunities. International Journal of Automation and Computing 5:1, 45-57. [CrossRef] 3. S.M. Lucas. 1998. Continuous n-tuple classifier and its application to real-time face recognition. IEE Proceedings - Vision, Image, and Signal Processing 145:5, 343. [CrossRef] 4. N.P. Bradshaw, I. Aleksander. 1996. Improving the generalisation of the N-tuple classifier using the effective VC dimension. Electronics Letters 32:20, 1904. [CrossRef]
Communicated by Joshua Alspector
The Effects of Adding Noise During Backpropagation Training on a Generalization Performance Guozhong An Shell Research, P.O. Box 60,2280 AB Rijswijk, The Netherlands We study the effects of adding noise to the inputs, outputs, weight connections, and weight changes of multilayer feedforward neural networks during backpropagation training. We rigorously derive and analyze the objective functions that are minimized by the noise-affected training processes. We show that input noise and weight noise encourage the neural-network output to be a smooth function of the input or its weights, respectively. In the weak-noise limit, noise added to the output of the neural networks only changes the objective function by a constant. Hence, it cannot improve generalization. Input noise introduces penalty terms in the objective function that are related to, but distinct from, those found in the regularization approaches. Simulations have been performed on a regression and a classification problem to further substantiate our analysis. Input noise is found to be effective in improving the generalization performance for both problems. However, weight noise is found to be effective in improving the generalization performance only for the classification problem. Other forms of noise have practically no effect on generalization. 1 Introduction
A neural network that is determined solely on the basis of its training set often does not give satisfactory results when applied to new data. This is the problem of generalization, or of model selection in the context of data modeling. There are a number of approaches to solve the problem. The validation-set method is one such approach. It uses an extra set of data, the validation set, to select a neural network. On the basis of its performance on the validation set, a neural network is selected from those neural networks that have equivalent training-set performance. The shortcoming of the validation-set approach is that it is effective only if both the training set and the validation set are large and representative. Otherwise, the neural network selected will be biased by the validation and the training set, since the selected network has indirectly adapted itself to the validation set. A secondary deficiency of the validation-set approach is that it does not provide any strategy for finding the desired model. Neural Computation 8, 643-674 (1996)
@ 1996 Massachusetts Institute of Technology
644
Guozhong An
Regularization is another established method that addresses the problem of generalization (e.g., Poggio and Girosi 1990; Weigend ef a / . 1991; Guyon rf al. 1992; Krogh and Hertz 19921. In the regularization approach, one adds a penalty term to the objective function of training. The penalty term serves as a constraint on the possible models. The success of the regularization approach depends on the specific form of the penalty term that is used and on the degree to which the penalty-term constraint is consistent with the underlying relation being sought. The weakness of this method is that the type of regularization is problem dependent; it is often not clear what form of regularization to use a priori. Recently, it has been observed that injecting noise to various parts of the neural network during backpropagation training can improve the generalization performance remarkably (e.g., Sietsma and Dow 1988; Weigend et a / . 1991; Hanson 1990; Clay and Sequin 1992; Neuralware 1991; Murray and Edwards 1993; Rognvaldsson 1994). Most forms of introducing noise that are found in the literature belong to one of the following classes: (1) data noise that is added to the inputs and outputs of the neural network (e.g., Sietsma and Dow 1988; Weigend et a / . 19911, (2) weight noise that is added to the weights of the neural networks (Hanson 1990; Neuralware 1991; Clay and Sequin 1992; Murray and Edwards 1993; Hinton and van Camp 19931, and (3) Langevin noise that is added to the weight changes (Rognvaldsson 1994). In this paper, we investigate the mechanisms that have led to the observed better generalization performance. Using results from stochastic optimization and statistical mechanics, we derive and analyze the cost functions that are minimized by the posttraining weight vectors. In particular, we aim to determine the conditions under which each type of noise addition will contribute positively to the generalization performance. We have treated output noise and input noise on an equal footing. Both mean-square and cross-entropy error functions are considered. Our analysis of the weight noise is limited to the mean-square error function. The conclusions we reached for the Langevin noise do not depend on the specific form of the error function. Simulations have been performed on a regression and a classification problem to verify our theoretical predictions. Input noise that is added to the inputs of neural networks has been previously studied with the mean-square error function. Interpreting the addition of noise to the inputs as generating additional training data, Hoimstrom and Koistinen (1992) showed that the method is asymptotically consistent, i.e., as the number of training examples increases to infinity and the variance of the noise decreases to zero, training with input noise is equivalent to minimizing the true error function. Closer to the spirit of our work is that of Reed et al. (19921, Matsuoka (1992), and Bishop (1995). Without detailed analysis, Reed et al. and Bishop assume that backpropagation training with input noise converges to a kind of average of a stochastic error function. On the basis of a Taylor expansion of the expected error function, they concluded that training with input noise
Adding Noise During Backpropagation
645
is equivalent to a form of regularization. Matsuoka proposes to minimize an objective function that is formed by adding a sensitivity measure to the standard error function. He then asserts that this objective function is minimized by the backpropagation algorithm with input noise. Using a recently proved convergence theorem on stochastic gradient descent, we have rigorously derived the objective function that is minimized by the backpropagation training with input noise. It takes the form of the expectation of the noise-infected error function over the noise distribution. Performing a careful analysis of this objective function, we are led to contradict Reed (1992), Matsuoka (1992), and Bishop (1995). Even in the weak-noise limit, training with input noise is not equivalent to a regularization method. The difference lies in a noise-induced term in the objective function that depends on the fitting residues. This contribution to the objective function escaped the notice of Reed et al. and Matsuoka. Although the residue-dependent contribution was found by Bishop (19951, he concluded that such a term vanishes at the end of training, which is true only in the case of an infinite training set. In our simulation, we found that the importance of this term is comparable to that of the regularization term. We do not dispute that the main mechanism for the improved generalization of input noise is due to the noise’s smoothing effect. We merely point out that input noise in fact smoothes the neural network function in a way that is different from that of the regularization approach. Murray and Edwards (1994) attempted to analyze the improvement of generalization that they observed on two classification problems. On the basis of a heuristic analysis and the results of their case studies, they believed that adding weight noise to all training algorithms should, in general, have a positive effect on generalization. By treating weight noise as a special type of noise added to the output of the neural network, we are able to rigorously analyze the addition of zero-mean and constant variance gaussian and uniform noise to the weights during on-line backpropagation training. Our theoretical analysis and simulation results, however, do not demonstrate that weight noise improves generalization in all cases. In the following section, we introduce the definitions and notations that we rely on in the main sections of this paper. The effects of the data noise and the weight noise are respectively analyzed in Sections 3 and 4. In Section 5, we study the Langevin noise. Two illustrative examples are presented in Section 6 to highlight the results obtained in Sections 3-5. We present our conclusions in Section 7. 2 Definitions
We consider multilayer feed-forward neural networks with input vector x and, without loss of generality, a scalar output f(x, w). Let u, be the
Cuozhong An
646
output and 0, the bias of the ith neuron, wit the weight connection from neuron I to neuron i, and /I( t ) the transfer function. At the input layer, we have o, = x,.For the hidden and output layers, we have
We shall refer to the 7i',lsand 0,s collectively as weights, and denote them by the vector w. Denote the training set that consists of N input-output pairs by { z i ' = ( x ' ' . y l ' ) 1' = 1 . 2 . . .N}. Let P ( Z { ' . W ) be the error function for the (rth training example. Let
1 E(w)= -
" 1 e(?.
w)
(2.2)
1,=I
be the total error function of the training set. The normal backpropagation training algorithm (Rumelhart etal. 1986) and its on-line version are both described by the following weight-update equation: W,+l
= W!
+
Aw,
(2.3)
where wi is the weight value and Aw, the weight change at iteration t . For the normal backpropagation training, the weight change at iteration t is given by Awi = -qiVE(w), where is the learning rate. In the case of the on-line backpropagation training algorithm, I w , = -'/!Ce(z.w )
(2.4)
where z is randomly drawn from the training set (2'. z2. . . . .2'. . ...zN} at each iteration. The normal backpropagation training minimizes the error function E(w) by steepest descent. Under appropriate conditions, it can be shown that the on-line backpropagation algorithm minimizes the error function E(w) by the stochastic gradient descent (White 1989). The three types of noise mentioned in the previous section are described by the following substitution rules in the computation of the weight changes I w : 1. data noise 2'
- 2' + <
2. weight noise w
-
3. Langevin noise I w
w
+<
--t
Aw + 5
Here, C denotes a noise vector that has the same dimensions as 2, whereas has the same dimensions as w.
<
Adding Noise During Backpropagation
647
3 Training with Data Noise 3.1 The Cost Function. The following major steps are involved in injecting data noise into the on-line backpropagation training algorithm: 1. Select a training example 2’ = ( x @ ~ Y Prandomly ) out of the N train-
ing examples. 2. Draw a sample noise vector C P from a density 3. Set z = Z P +
p(
With this procedure, the probability density of generating a particular data point z = (x.y) from the pth training example P is thus p(C!-’) = p(z - 2 ” ) . The total probability density of z being generated from the complete training set is then
c
l N g(z)= p(z
- 2”)
N p=l
(3.1)
Let c(u,w) be a continuous and differentiable function of w, and let u be a random vector sampled from the distribution g(u). Under general conditions, Bottou (1991) has shown that stochastic gradient descent, i.e., the following iteration Wt+l = Wt
- TltVwC(U,w)
(3.2)
converges to a minimum of the expectation values of c(u.w), i.e., to C(W)
=
/ c(u,w)g(uPu
13.3)
provided that Ct qt = 00 and Ct $ < 00. The on-line backpropagation training algorithm is a special case of the stochastic gradient descent method in which the function c is given by the error function e(z,w), and the data density g is given by &(Z)
=
c N-6(z I
-
(3.4)
P
The delta function 6 in equation 3.4 is the Dirac delta function. It follows from equation 3.3 that the on-line backpropagation training converges to a minimum of E(w). Combining equation 2.4 with equations 3.1-3.3, it follows that training with data noise minimizes not E(w) but
/
E(w)= e(z,w)g(z)dz
(3.5)
In passing, we remark that equation 3.1 can be viewed as a kernel estimation of the true density of the data based on the training set (Holm-
Guozhong An
648
strom and Koistinen 1992). The distribution of the noise serves as the kernel function. The main difference between E(w) and & ( w )is that the former measures the error over a finite number of training examples, whereas the latter measures the error over an infinite data set that is described by the continuous density j ( z ! . Those data that are described by g ( z ) but are not contained in the training set can be considered as synthetic data generated by the noise. Such synthetic data supplement the training set and dilute the importance of an individual example contained in the training set. Thus, noise injection could prevent the neural network from overfitting the training set and may result in neural networks that are insensitive to noise in the data. We see that training with data noise effectively minimizes a cost function that differs from the standard error function E(w);the injected noise implicitly alters the training objective function. There are a number of methods that explicitly modify the cost function to improve generalization. A widely used method to construct a new cost function C(w) is to add a penalty function P ( w ) to the standard training error E(w):
(3.6) where X is a positive constant. Training is then achieved by minimizing C(w). It is a standard result of variational calculus (see, e g . , Arfken 1985) that the vector that minimizes C(w) also minimizes E(w) under a constraint of the form
P ( w )= constant
(3.7)
where the value of the constant depends on A. The specific form of the penalty function depends on one’s assumptions. In intuitive approaches, such as weight decay (Hinton 1986) and weight elimination (Weigend ct nl. 19911, P represents an intuitive measure of the size of the neural network. In the maximum a posteriori probability approach, P(w) represents the negative of the logarithm of the a priori probability of finding a set of weight w (MacKay 1992; Nowlan and Hinton 1992). In an information theoretical approach, such as the minimum description length approach, P(w)would represent the description length of the model (Rissanen 1989; Kendall and Hall 1993). In the regularization approach as applied to inverse problems (Poggio and Girosi 1990), P ( w ) represents a smoothness measure. Minimizing C(w)can lead to better generalization than does minimizing E(w), but only if C(w) contains a constraint that is consistent with the underlying data-generating process. To gain more insight into E , and to make connections with other approaches, we compute the difference between & and the standard error E. This difference corresponds to the penalty function of equation 3.6.
Adding Noise During Backpropagation
649
3.2 Penalty Function Induced by the Data Noise. Let us define a penalty function P by &(w)= E(w)
+ P(w)
(3.8)
Combining equations 3.1, 3.5, and 3.8 (and for brevity suppressing the argument w in e), we have
P= 1
c [/e(z)p(z
-
zfi)dz - e(z.)]
(3.9)
fi
Replacing the integration variable z by
C =z
- zp,
we obtain (3.10)
Denoting the average over the noise distribution p(C) by rewrite P as
Expanding e(z + C) as a Taylor series in
&(z) e(z + <) = e ( z ) + -Ci i3zi
()c,we
can
C, we have
1 + -21pd2e(z) C i C j + . . . + ,(Cidl)"e(z) aziazj n.
(3.12)
where 8; denotes the partial derivative with respect to z,, and summation over repeated Latin indices is implied. To proceed further, we assume the following: (1)different components of the noise are independent; (2) the noise distribution is symmetric about zero, which implies that all the odd-order moments, including the mean, vanish. Stated mathematically, (C,?) = 0 when n is odd. With these conditions, we have (3.13) where 2T, represents the variance of the ith component of the noise, i.e., (C,') = 2T,, and R denotes the remainder. The remainder R is given by
where m is the dimension of z. The summation C' is over all possible combinations of n; that satisfy C, n, = n. Assuming that the remainder R is negligible compared to the first term, we have (3.15)
Guozhong An
650
Replacing z by zf' in equation 3.15 and substituting it into equation 3.11, we obtain
(3.16) In the weak-noise limit, i.e., when TI << 1, we can show that R is indeed negligible for two popular noise distributions: the gaussian distribution and the uniform distribution. We have, for the uniform distribution that is constant within an interval [-a.a] and zero outside, n'
if
IZ
if
11
is even is odd
(3.17)
In the case of gaussian noise (see, e.g., Kendall and Stuart 1977), the moments (<:) are given by 2
if if
II
IZ
is even is odd
(3.18)
Using the above expressions for the moments, we can evaluate the higherorder terms contained in R. It is evident that they are of higher orders in T, and hence negligible in the limit T << 1. Defining a Hessian matrix H by H , , ( z ) = L)z~(z)/Ozldz,, and for simplicity assuming TI = T , we can rewrite equation 3.16 in a more compact form as
P zz
T -
N
1TrH(zf' )
(3.19)
i!
where Tr is the trace operator. Next, we apply equation 3.16 to the two most popular forms of the error functions, i.e., the quadratic and cross-entropy error functions, which are given, respectively, by 1, u,,(z.w)= - ll/ - f i x . w ) 2 -.
(3.20)
'1 11d
(3.21) Substituting the quadratic error function of equation 3.20 into equation 3.16 and denoting f(x". w ) by f!i, we have, after some algebraic manipulation,
Adding Noise During Backpropagation
651
where
Po
=
1
(3.23) (3.24) (3.25)
For simplicity, we have assumed here again that the strength of the noise is T for all components. If we take the cross-entropy error function of equation 3.21 instead of the quadratic error function, the penalty function again takes the form of equation 3.22 with 1
(3.26) (3.27) (3.28) In line with our assumption, each component of the noise contributes to P independently as can be seen from equation 3.16. The POterm in equation 3.22 is due to the noise on the desired output; the PI and P2 terms are due to noise in the input. Because the penalty induced by noise on the desired output, PO,does not depend on w, it has no influence on the minimum of the cost function. Therefore, noise in the desired output does not influence the neuralnetwork function f(x,w) obtained at the end of a training procedure. Zero-mean and constant-variance noise added to the desired output thus have no effect on generalization. In contrast to PO,the penalty terms induced by the input noise, PI and P2, are w dependent. The term PIof equation 3.24 is obviously positive. Because the cross-entropy error function is only used for classification problems for which y, f (x, w) E [0,1],it is easy to show that PI of equation 3.27 is also positive. The PI terms favor a slow-varying input-output mapping f (x. w) and penalize fast varying ones. Reed e f al. (1992) and Matsuoka (1992) previously obtained the PIterm of equation 3.24. However, owing to an inconsistency in their approaches, they failed to find the P2 term. Bishop (1995) also obtained the P2 term, but he concluded that the P2 term vanishes. However, his argument crucially depends on the fact that the conditional probability p(y I x) of a target output y given an input x, is known at an arbitrary input x, which is true only in the case of an infinite training set. His conclusion therefore does not apply to problems for which one has only a finite training set. The penalty term P I induced by the input noise closely resembles
652
Guozhong An
the penalty terms that are commonly used in the regularization methods (Poggio and Girosi 1990). In particular, the PI terms of equation 3.24 and equation 3.27 take the form of iILf/]’ where L is a differentiation operator and I / . / I represents a norm in a function space. P2 depends both on the fitting residues f ( x . w) - y and the second-order derivatives off with respect to x; it can, in general, be either positive or negative. It is the P2 term that distinguishes training with input noise from a regularization approach. In general, the importance of P I should outweigh that of P2 whenever the fitting residues are reasonably small or the function is very smooth. In this case, the penalty term induced by the input noise measures the sensitivity of the neural-network output against a small variation in the input. The noise strength balances the task of fitting the training examples on the one hand and the task of not overfitting them on the other hand. Noise in the inputs hence constrains the neural-network output to be a smooth function of its input. It is worthwhile to point out that Drucker and Le Cun (1992) recently proposed a procedure by the name “double backpropagation” to minimize the sum of E and the P , term in equation 3.22. It is thus related to on-line learning with input noise in the case where the variance of the noise is small and the neural network has learned sufficiently from the training set.
4 Training with Weight Noise
We have seen that on-line backpropagation training with data noise is equivalent to enlarging the training set with synthetic data. The added synthetic data force the neural network function to be less sensitive to input variations. Therefore, they avoid overfitting and possibly improve generalization. In this section, we study the effect of weight noise on generalization with particular reference to the quadratic error function. The convergence property of training with weight noise turns out to be difficult to analyze. In the following, we bypass the converge analysis and directly compute the weight-noise induced penalty terms using equation 3.11. The justification for this is that weight noise, in the weak limit, can be treated as a type of output noise, albeit unusual. Weight noise, like data noise, affects the I w in the weight-update equation only through the neural-network output function f(x. w) and its derivatives with respect to w. The neural-network output f ( x . w ) appears in the quadratic error function only in the form of y’‘ - f ( x ” > w). Hence, any change of f(x,‘. w) can always be replaced by a change of yp. In this way, we can treat the weight noise as a special type of noise in the desired output. In the presence of weight noise (, the neural-network output is given
Adding Noise During Backpropaga tion
653
by f(x. w + [). Expanding f(x. w + [) as a power series in [, we obtain
where summation over repeated indices is implied. Denoting by the noise in the desired output that is induced by the weight noise (and using again the implied summation convention), we have (4.2) Assuming the different components of the noise are identically independently distributed, and have a zero mean and variance 2T,, we obtain to the leading order in Ti, (4.3) (4.4) For gaussian noise and uniform noise, it can be shown (much as we did in the previous section) that the terms that are neglected in computing (6) and (:} are indeed of higher order in T. Repeating the steps in the calculations leading from equation 3.11 to equation 3.16 and noting that now has a nonzero mean, we obtain an induced penalty term
c0
In deriving equation 4.5 we implicitly assumed that all the higher moments of &, i.e, : n > 2}, are negligible compared to the mean and in the limit T << 1. This can be verified for the first few variance of higher moments; however, we do not have a general proof. Substituting equation 3.20 into equation 4.5, and for brevity denoting f(x@,w) by fp, we have
c0
{(c:)
(4.6) The first term of equation 4.6 depends on the fitting residues. It does not have a definite sign. The second term in equation 4.6 is clearly always positive. In the case of relative small fitting residues, the positive definite term outweighs the other term in equation 4.6. In such a case, the weight noise penalizes weight configurations leading to neural networks that have large sensitivity with respect to weight variations. Therefore, training with weight noise should increase the tolerance of neural networks to faults in the weight connections (Clay and Sequin 1992;Murray
654
Guozhong An
and Edwards 1994). The fault-tolerance property is particularly relevant to the hardware implementation of neural-network weights using analog circuits in which one must deal with possible electronic noise. Note the similarity between equation 4.6 and equations 3.24 and 3.25. This can be understood from the fact that both the input noise and the weight noise propagate through the neural network, affecting the Aw through the neural-network output function f(x. w) in analogous ways. Despite such formal similarity between input noise and weight noise, their effects on the generalization can be rather different. Introducing a roughness penalty with respect to input variations is familiar to those following the regularization approach. However, applying a roughness penalty with respect to parameter (weight) variations is less usual from the viewpoint of regularization, since it is the input-output mapping, not the weight-output mapping, that directly determines the generalization performance. In particular, smoothness in the weight-output mapping is not equivalent to smoothness in the input-output mapping. To illustrate the above points, let us consider a neural network that has a linear output unit and no hidden layers. Denoting the weights by w and with the convention xg = 1, we have f(x.w ) = x . w. According to equations 3.24 and 4.6, the penalty terms that are induced, respectively, by the input noise and the weight noise take the following form:
14.7)
The input-noise penalty term in equation 4.7 coincides with that used in weight decay and in ridge regression. In contrast to the input-noise penalty, the weight-noise penalty term is a constant; hence, it has no effect on the generalization performance. To infer how weight noise influences the input-output mapping and hence the generalization performance of a network with hidden layers, we need to compute the penalty terms of equation 4.6 in more detail. Noticing that each component of the noise contributes to the penalty P of equation 4.6 additively, their contributions may be separately computed. The weight-noise penalty can thus be expressed as a sum Po + P h , where Po represents the contribution from output-layer noise (noise added to the fan-in weights of the output layer) and P h the contribution from hiddenlayer noise (noise added to the fan-in weights of the hidden layers). In the following, we compute Po and P h .
4.1 Weight Noise Added to the Output Layer. Consider now an output unit. Let LJO be its bias and w,be the weight connecting it to the jth hidden unit. Let I be its sum of inputs including the bias and h(x) its transfer function. For the output unit, we have f(x. w) = h ( l ) , where
Adding Noise During Backpropagation
I
= c j a j w ; and a.
655
= 1. We thus have (4.8)
where the prime denotes differentiation with respect to I. Notice that the outputs of the hidden neurons, the ajs, do not depend on the weights of the hidden layer, the wjs. A second differentiation therefore leads to (4.9) Substituting equations 4.8 and 4.9 into equation 4.6, we obtain (4.10) where we have made the dependence of a, on the training patterns explicit. The j = 0 term in equation 4.10 is due to the output-bias noise. < h’(I)2, it follows from equation 4.10 that the Assuming Ih”(l)(yp - f p ) l output-layer noise favors small activations at the hidden units and small
derivatives at the output unit. The penalty term in equation 4.10 charges each hidden unit a penalty in proportion to the square of its activation la,I2. A hidden unit with a very small activation aj would contribute little to the network output. In this way, weight noise limits the number of hidden units that can be used to fit the training data. Equation 4.10 further shows that outputlayer noise favors neural networks that have small output derivatives Ih’(1)l. For a sigmoidal output unit [ h ( l )= tanh(1) or logistic], the output derivative is smallest when the output unit operates close to its two saturation states, i.e., / I / >> 1. The desire to have large I (small output derivatives), however, is not in harmony with the desire to have small hidden-layer activations, since I = &ajw,. If the output unit is linear, which is often the case for regression problems, the derivative factor is irrelevant since h’(I) = 1 and h”(I) = 0. The only effect of the outputlayer noise is then to reduce hidden-layer activations. In such a case, the output-bias noise contributes a constant term T to Po, and has no effect on generalization. 4.2 Weight Noise Added to the Hidden Layer. For simplicity, we consider neural networks that have only one hidden layer. The extension to more hidden layers is straightforward. Denote the weight connection between the ith input unit and the jth hidden unit by w,i. In a way similar to that in which the error gradient ae(w)/aw in the backpropagation algorithm (Rumelhart et al. 1986) is computed using the chain rule of differentiation, we have
(4.11)
656
Guozhong An
where H ( x ) is the transfer function in the hidden layer, d, the weight connection between hidden unit j and the output unit, and I, = C ,wilxl. A second differentiation yields (4.12) Combining equations 4.11 and 4.12 with equation 4.6 and carrying out the summation over the inputs, we have
(4.13) where the summation j is over all the hidden units, and that of / I over all the training patterns. In the following, we assume that Iyl‘ - f I ) 5 1 for most of the training examples, and investigate the effect of P h accordingly. We see from equation 4.13 that kidderi-layer rzoise penalizes large derivat i z m arid large zoeigkts at tke oiitptit layer as zuell as large derivatives at the kidden layer. Large derivatives at the output layer are already penalized by the output-layer noise; hidden-layer noise thus reinforces such a penalty. Penalizing large weights at the output layer is the same a s favoring small outgoing weights at the hidden layer. A hidden unit that has both a small outgoing weight A’,and a small activation a,, which is encouraged by the output-layer noise, has negligible contribution to the network output. Therefore, the combined effect of penalizing large output-layer weights and large hidden-layer activations is to encourage the neural network to use fewer hidden units. However, penalizing large derivatives at the hidden layer works against penalizing large activations at the hidden layer, which is imposed by the output-layer noise, since small derivatives, i.e., small lH’(Il)[are accompanied by large outputs lall = lH(Il)1. It is plausible that those hidden units that have a relatively large activation, and arc‘ thus essential in fitting the training data, are encouraged to operate close to their two saturation states by hidden-layer noise. Those hidden units that have a relatively small activation, and are thus less essential in fitting the training data, are likely to be made redundant. We note also that the hidden-layer penalty is weighted by the length of the input vector / X I ’ / *Therefore, . the effects of the hidden-layer noise could be more pronounced at large inputs than at small inputs. In summary, the main effects of weight noise are (1) reducing the number of hidden units, and (2) encouraging sigmoidal units, especially in the output layer, to operate in the saturation states, i.e., firmly on or off. The first effect limits the number of hidden units in a network and hence can prevent overfitting. This way of preventing overfitting is rather different from the way in which input noise prevents overfitting; it bears a resemblance to weight-decay methods (e.g., Chauvin 1989). The
Adding Noise During Backpropagation
657
second effect is related to the effect of input noise, since a small derivative at the output unit contributes to a small Iaf(x. w)/axl = h’(I)laI/axl and hence a smoother input-output mapping (except at the class boundary). If these two mechanisms to prevent overfitting do not interfere with learning the essential input-output mapping from the training set, one could expect improved generalization performance. The degree of improvement, however, is likely to be problem dependent. Although we performed the analysis of the effects of data and weight noise for the case of on-line backpropagation training, essentially the same type of analysis may be applied to study the effects of noises in normal backpropagation training. In fact, that application is given in the Appendix; the results show that adding data noise to the batch (normal) version of the backpropagation algorithm has the same effect on generalization as adding noise to the on-line backpropagation algorithm. 5 Training with Langevin Noise
Both the data and weight noise affect the dynamics of training through e(z,w) during the evaluation of Aw. In contrast, the Langevin noise bypasses the neural network and the error function; it is directly injected to the weight changes as indicated in equation 2.7. Because the effect of Langevin noise on the training differs fundamentally from the effect of data and weight noise, a completely different analysis is required. We analyze the Langevin noise using methods of statistical mechanics. The weight update rule for training with Langevin noise reads (Hertz et aI. 1989; Rognvaldsson 1994; Guillerm and Cotter 1991)
where the noise is a gaussian random variable with mean zero and unit standard deviation. Let At = 77 be small and constant. Equation 5.1 can be viewed as a discretised version of the following continuous-time Langevin equation (Gillespie 1992; Seung et al. 1992): dw = -VE(w)dt + EJ2Tdt
(5.2)
If one regards w as the coordinate vector of a Brownian particle, then equation 5.2 describes the dynamics of the particle at a temperature T . The particle starts near the origin of the weight space, when w is initialized to a small value at the beginning of the training. It then diffuses randomly away from the origin by random walks. The gradient of the training error biases the walk toward a minimum of E . Imagine an ensemble of identical networks that all start from the same initial weight and evolve according to equation 5.1. Owing to the presence of noise, each network in the ensemble will have a different
Guozhong An
658
w ( t ) at a given time f. The state of the whole ensemble at any moment can be described by a distribution G(w.f). It is a standard result (van Kampen 1981; Gillespie 1992)that the Langevin equation 5.2 is equivalent to a Fokker-Planck equation for G(w. t). The stationary solution of the Fokker-Planck equation is described by a Gibbs distribution 1 . G(w): - exp .-E(w),T] Z
(5.3)
where Z = Jdwexp [-E(w)/T]. It is common to start training with a relatively large amount of Langevin noise and then to reduce it gradually to zero. When the noise is slowly reduced, the Brownian dynamics wilI lead to a new Gibbs distribution at a lower temperature T. At a very low temperature T, the Gibbs distribution of equation 5.3 will be highly peaked at the global minimum of E ( w ) . Thus, the Brownian dynamics minimizes E(w) as the temperature is gradually reduced. Training by Brownian dynamics with annealing likewise minimizes E(w), the end configuration of the training being the global minimum of E(w). Therefore, aizrieald Laiigeziin noise hns iio regularization effect. Langevin noise would lead to improved generalization only if the network is not oversized with respect to the amount of training data. In fact, the minimization procedure should be generally applicable to problems where gradient information is available. Geman and Hwang (1986) and Kushner (1987) proved the asymptotic convergence to the global minimum of E(w),provided that the temperature is reduced slowly. Global minimization by Brownian dynamics shares the same annealing ideas as the simulated-annealing algorithm of Kirkpatrick et al. (1983). The simulated-annealing algorithm, however, uses the Metropolis algorithm rather than Brownian dynamics to generate a Gibbs distribution. Performing simulated annealing by the Brownian-dynamics method instead of by the Metropolis method can have advantages, because the Brownian-dynamics method uses the gradient information. The Browniandynamics method can therefore be expected to be more efficient than the Monte Carlo method. Note, however, that the Metropolis algorithm can handle both continuous and discrete variables, whereas the Browniandynamics method is suitable only for continuous variables. 6 Experimental Results
In this section, we describe experiments we have performed on two trifling problems, to investigate the following points: (1) Whether input noise constrains the neural-network function to be smooth and hence can improve generalization and output noise has no effect on generalization; ( 2 ) how weight noise affects generalization; and ( 3 ) whether Langevin noise helps to avoid local minima.
Adding Noise During Backpropagation
659
6.1 A Regression Problem. Consider a one-dimensional scalar function defined by
y = sin3(x + 0.8)’ + E
(6.1)
Equation 6.1 specifies a data-generation process in which we included gaussian noise with zero mean and a standard deviation of 0.4 to simulate measurement errors on y. A training set consisting of 15 data points, {(x”.y”) : p = 1 . 2 . . .15}, was generated using equation 6.1 with uniformly spaced xw in the interval [-1,1]. A neural network with one input unit, 15 hidden units, and one output unit, denoted by N1-15-1, was used to fit a curve through the points. The transfer function for the hidden units was taken to be tanh(x). Since we are dealing with a regression problem, the transfer function for the output unit was taken to be linear, i.e., x. Neural networks with the aforementioned architecture were trained using noise-injection training algorithms and compared with neural networks that were trained by minimizing the quadratic error E(w). We define a generalization error E, as (6.2) In the limit of infinite training examples, the training error E(w)approaches E,. 6.2.2 Data Noise. One practical issue in carrying out experiments on data noise is that the on-line backpropagation training algorithm converges very slowly. (With noise injection, we need more than five million iterations to obtain convergence.) Another issue is the need to tune the training parameters to obtain convergence. The conjugate-gradient algorithm was therefore found to be a better training alternative. In that case, the &(w)of equation 3.5 is approximated by a sum over a finite number of synthetic data that are generated according to the density g ( z ) of equation 3.1. This sum essentially represents a Monte Carlo integration of &. To limit the approximation error, a large synthetic data set is needed. We found that 1000 x N synthetic data points are sufficient for the present problem, in which N = 15 is the number of training examples. To avoid local minima, each training session was repeated four times with a different starting point. The one that had the smallest training error was used. The results of training with input noise are shown in Figure 1. The dashed line represents the neural-network function f(x, w) determined by minimizing E(w) using the conjugate-gradient algorithm. The gray line corresponds to the neural-network function that is obtained by training with input noise (T = 0.005). The data generation model (y) is represented by the solid line. Notice that the dashed line passes through
Guozhong An
660
1
0.5 0
Y
-0.5 -1 -1.5
-2
Figure 1: Smoothing effect of input noise. The dots represent the training examples. The underlying model (y) that generates the training set is given by the solid line. The gray and dashed lines represent the neural-network functions obtained from training with and without input noise, respectively. See the text for a full explanation. almost all the training patterns; it overfits the training set. The gray line only globally satisfies the training set and avoids fitting the noise present in the training data. The distance between the gray line and the solid line is clearly smaller than the distance between the dashed line and the solid line. Therefore, the neural network trained with input noise generalizes better than the neural network trained without noise. We found that training with input noise reduces the generalization error for this problem by as much as 61% (Table 1). Notice that training with input noise forces the neural-network function to be smooth. A smoother neural-network function avoids overfitting the training set and leads to better generalization, as predicted in Section 3. It follows from equation 3.22 that the variance of the injected input noise controls the degree of smoothing of the neural-network function. The larger the variance (2T),the smoother the neural network function will be. This is confirmed by our simulation. We have trained the neural network with input noise levels corresponding to T = 0.002,0.005, and 0.009. We found that the neural-network functions become progressively smoother with increasing T . The generalization error decreases at first with increasing noise level, but then it increases with increasing noise level. A value of T = 0.005 gives the best generalization results for
Adding Noise During Backpropagation
661
Table 1: Generalization Performance of Training with Noise for the Regression Problem
Noise type No noise Input noise Input noise Input noise Langevin noise
Training Generalization Reduction in error error generalization Variance E , (lo-’) error (%) E (lo-’) 2T(1W2) 0.25 1.9 7.1 11 3.4E-7
12.0 6.22 4.69 5.54
48 61 54
11.4
5
0 0.4 1 1.8
the present problem. For this value of T , the standard deviation of the noise is 0.75Ax, where Ax = 2/15 is the average interdatum distance in the training set. To investigate the relation between the new cost function I ,the standard error function E , and the noise-induced penalty functions PI and P2, we measured their values as a function of the time (number of iterations) during minimizing E using the conjugate-gradient method. Results obtained for a single training process with T = 0.005 are plotted in Figure 2a and b. At the start of training, the neural network is initialized with small weights. In that case, the neural-network function is quasilinear with a small slope. This is reflected by the very small values of TP, and TP2 at the start of training (Fig. 2a). As training proceeds, the standard error E decreases, while the penalty terms TP, and TP2 increase. At some stage of training (70 iterations in this case), the reduction of E is counterbalanced by the increase in the penalty terms. This explains the ability of input noise to avoid overfitting. In Figure 2b, we have plotted the new cost function I ,together with E T(P,+ P2)and the generalization error E,, as a function of the training time. The figure shows that E , decreases almost monotonically with training time. It also indicates that the difference between & and E is well approximated by T(P, + P2). In contrast to the input-noise case, we find practically no difference between the neural-network functions obtained from training with and without the output noise. This result is in agreement with our predictions as given in Section 3. It is instructive to examine the synthetic data generated by the input noise (Fig. 3a) and by the output noise (Fig. 3b). The neural-network functions are forced to vary slowly with x by the synthetic data that are generated by the input noise, since there is an interval of x that corresponds to a few values of y, as can be seen from Figure 3a. In comparison, the synthetic data generated by the output noise spread vertically about the original data point. These synthetic data share the same best fit with the original training set. That is why the output noise does not regularize and hence has no effect on generalization.
+
Guozhong An
662
,
,
,
.
,
.
.
E -
m --*---TP,
0.4
0'3 0.2
-D
I\
0
2
4
6
8
10 12 14 16 1 8 ( x 10) t
Figure 2: Time evolution of the standard training error E and the penalty terms induced by the input noise. See the text for a full explanation. 6.1.2 Weight Noise. To investigate the effects of the weight noise, we have trained neural networks using the on-line backpropagation algorithm with noise added to all the weights, including the biases. We varied T from 2 x lop4 to 0.02. The learning rate rj was set to 0.01. Without noise addition, the training error decreases to a value of order after 50,000 x 15 weight updates. It eventually decreases to about lo-'
Adding Noise During Backpropagation
663
Y
.-
t -0.5
X
..-.-.
0.5
1
..-. .-
- O-.1j / -1.51
Y
I
I
I -0.75
1-
! *
I !
015'1
I -0.5
I 0.25
-0.25 -0.5 1
-1:
I
0.15
0.5
I
1
x
I
Figure 3: (a) Synthetic data generated by the input noise; (b) synthetic data generated by the output noise. after very long iterations (500,000 x 15 weight updates). With noise addition, the convergence is even slower. The slowness in convergence of the backpropagation training implies that one cannot be sure that one has reached complete convergence. To minimize the potential regularization effect associated with early stopping (Sjoberg and Ljung 19921, all the neural networks have been trained with the same number of iterations (50,000 x 15 weight updates) instead of waiting for each to reach complete
664
Guozhong An
convergence. We found that fluctuations in the training error exist even after a very long training time. To reduce the dependence of our results on these fluctuations, the training and generalization errors are averaged over 10 weight configurations that are sampled at intervals of 100 weight updates just before training is stopped. We found fewer improvements on the generalization performance comparing with the case of input noise. The largest reduction in the generalization error was found to be 25% at T = 0.005, which is significantly less than the 61% achieved using the input noise. At low noise levels (T5 2 x the noise has very little effect. At high noise levels (7 > 0.02), the neural-network function attains a quasilinear form and the generalization performance deteriorates. To verify the predicted mechanisms by which the weight noise improves generalization, we examined the hidden-layer activations and derivatives. With no noise injection, it was found that all 15 hidden units contribute to the network output. At T = 1.25 x on average only five hidden units out of 15 had nonnegligible contributions. When the noise level was increased to T = 0.01, the number of active hidden units was further reduced to three. This result stands in contrast to training with input noise, where most of the hidden units contribute to the network output. We further found that those active hidden units had activations close to either -1 or 1. The foregoing observations are in agreement with our predictions that weight noise reduces the number of hidden units and encourages small derivatives at sigmoidal units.
6.1.3 Lnizgeviii Noisc The neural-network function obtained using Brownian dynamics with a gradually decreasing T is shown as the dashed curve in Figure 4. The temperature ( T ) is decreased exponentially with the number of iterations during training. The solid curve represents the neural-network function obtained using the conjugate-gradient algorithm. We see that the Langevin noise has indeed no regularization effect, although it is more effective in finding the global minimum of E(w)as manifested by the perfect fit of the training set (shown by the dots). The training errors and generalization errors for the various neuralnetwork functions are given in Table 1.
6.2 A Classification Problem. Our second example is a two-class classification problem. Through this example, we demonstrate how the input noise and weight noise affect the decision boundaries and how they improve a neural networks generalization performance for classification problems. The classification problem under consideration is defined by two overlapping bivariate normal distributions. Let N ( p . 1)be a bivariate normal distribution with mean p and covariance matrix X. The joint probability
Adding Noise During Backpropagation
665
1
0.5
0
?/
-0.5 -1 -1.5
-2 -0.5
-1
0
0.5
1
X
Figure 4: Training with Langevin noise. The dashed and solid curves represent the neural-network functions obtained using Brownian dynamics and the conjugate-gradient algorithm, respectively. Note that the neural network function obtained using the Brownian dynamics (the dashed curve) fits the training set perfectly. This suggests that the dashed curve is at the global minimum of E(w) whereas the solid curve is at a local minimum. that x will be found and that it belongs to the first class, Class A, is given by 1 (6.3) P(A,x) = 2N(PA, C) The joint probability that x will be found and that it belongs to the second class, Class B, is given by
P ( B ,x)
1
=
2 W B , C)
(6.4)
The means and covariances appearing in equation 6.3 and equation 6.4 are given by pA = (1,-1);
pa= (-1,l);
and C = I
(6.7)
where I is the two-dimensional identity matrix. Assuming the cost of misclassifying Class A is the same as that of misclassifying Class B, one can derive the optimal classification rule for a two-class classification problem (e.g., Pao 1989): classify x as
Class A Class B
if P(A,x) > P ( B , x) otherwise
(6.8)
666
Guozhong An
The decision boundary implied by equation 6.8 is thus described by the equation P(A.x) = P ( B . x). Using equations 6.3 and 6.4, we find that the ideal boundary for the present problem is described by the equation s1 = xz. A total of 40 training examples were generated, 20 from each class. We assigned a desired output of -0.9 to Class A and 0.9 to Class B. The architecture of the neural network was chosen to be N2-101. Both the hidden and the output units are assigned a tanh(x) transfer function. The neural network had 40 parameters (weights). With the chosen architecture, the number of weights in the neural network matches the number of training examples. Training was performed using the conjugate-gradient algorithm in all cases. The decision boundaries of the neural network are taken to be fix.w)= 0 for the present classification problem. Gaussian noise with zero mean was used in all cases. 6.2.7 Data Noise. In agreement with our results on the regression problem and with our theoretical analysis, the input noise also makes the neural-network function smoother for classification problems. With an appropriately chosen noise strength T , the neural-network decision boundary comes close to the ideal one. As a consequence, it is found that the generalization performance is significantly improved. The results of training without noise injection are presented in Figure 5. The neural-network function f(x.w) is plotted in Figure 5a and the training set and the decision boundary are shown in Figure 5b. The diamonds and the stars represent the training examples from Class A and Class B, respectively. The ideal and the neural-network classification boundaries are represented by the thin and thick lines, respectively. The figure shows that the neural network adapts so much to the training set that it classifies the training set perfectly. It does so by forming complex boundaries, however. Poor generalization is evident, since the neural-network decision boundary differs considerably from the ideal one, as can be seen from Figure 5b. The results of training with input noise (T = 0.125) are presented in Figure 6. It is clear from a comparison of Figures 5 and 6 that input noise smoothes the neural-network function. Input noise causes the decision boundaries to be less well adapted to the training set data and to be closer to the ideal one. Improved generalization is apparent. To quantify the generalization performance, we measured the classification error on a large test set (10,000 data points) that was generated in the same way as the training set. The generalization performance of this training with input noise is summarized in Table 2. As it can be seen from the table, training with noise reduces the misclassification rate from 27 to 9%. 6.2.2 Weight Noise. We found, in agreement with previously reported results (Murray and Edwards 19941, that weight noise improves the generalization performance for classification problems. The lowest misclas-
Adding Noise During Backpropagation
667
0
4
t
t
-3
-2
-1
0
1
2
:
31
(b)
Figure 5: Training without noise injection for a classification problem: (a) the neural-network function; (b) the classification boundaries and the training data. The thick line and the thin line represent, respectively, the neural-network classification boundary and the ideal classification boundary. The diamonds and stars denote the training examples from Class A and Class B, respectively. Note that the neural network makes a perfect classification on the training set by forming complex classification boundaries.
Guozhong An
668
Table 2: Generalization Performance of Training with Input Noise for the Classification Problem” Misclassification on training set (%)
0 3 10
Misclassification on test set (’5%) 27 17 9
Noise level T
0 0.02 0.125
“Bayes classification error for this problem is 8%. Note that, with a noise strength T = 0.125, the generalization performance differs from the optimal classification by only 1%
sification rate that we found on the test set was 1170,which is comparable to the 9% misclassification rate that was achieved using input noise (Tables 2 and 3 ) . For T > 0.018, the network functions on one side of the classification boundary had an almost constant - 1 value, which changed abruptly to a constant 1 value on the other side of the boundary. This is in line with our prediction that weight noise forces a small derivative jk’(1)I at the output unit, which in turn leads to flat network outputs. In contrast to the regression problem, we found that even with noise addition, all 10 hidden units contribute to the network output. This finding is also in line with the theory that penalizing large derivatives at the sigmoidal units weakens the effect of penalizing large hidden-layer activation. For the present problem, we may conclude that the smoothing effect, i.e., the small derivative effect, is the primary mechanism in improving generalization. To verify this, we applied noise only to the output bias since output-bias noise generates a penalty P = TC,,[h”(l)(y”- f , L ) + k’(I)’]/N (cf. equation 4.10), which favors small Ih’(l)jwithout affecting the hidden activations. As shown in Table 3, applying noise only to the output bias indeed improved the generalization as much as applying noise to all the weights. The best generalization was achieved at a temperature of T,, = 0.125. 7 Conclusions
In this article, we have studied three types of noise injection: data noise, weight noise, and Langevin noise. A rigorous analysis is presented of how the various types of noise affect the learning cost function. The noise-induced penalty functions were computed and analyzed in the weak-noise limit. Their properties were related to the generalization performance of training using noise. Experiments were performed on a regression and a classification problem to illustrate and to compare with formal analysis.
Adding Noise During Backpropagation
t
-3
-2
-1
669
0
,*
1
,
2
3
21
(b)
Figure 6: Same as Figure 5 except that the neural network is trained with input noise (T = 0.125): (a) the neural-network function; (b) the classification boundaries and the training set. Note that the neural-network function is smoother than that of Figure 5, and that the decision boundary is closer to the ideal one. Input noise is shown to add two penalty terms to the standard error function. One is identical to a smoothing term as found in regularization theory, while a second term depends on the fitting residuals. The
Guozhong An
670
Table 3: Generalization Performance of Training with Weight Noise for the Classification Problem" Noise type
Output bias -
All weights
Misclassification
Misclassification
on training set (%)
on test set (5%)
To
Th
0 7.5
23.3 11.1
0.005 0.125
0 0
0 7.5 7.5
24 10.8
0.011 0.018 0.045
0.011 0.018 0.045
11
Noise level
ST" and T h denote the temperature at the output and the hidden layer, respectively.
main effect of the input-noise induced penalty terms is in constraining the neural network to be less sensitive to variations in the input data. This smoothing effect could be beneficial to the generalization performance. In contrast to input noise, output noise results only in a constant term in the cost function, and hence does not improve the generalization performance. We showed that weight noise induces changes in the cost function that are formally similar to that of the input noise. The penalty functions that are induced respectively by the input noise and the weight noise would be identical if differentiations with respect to the inputs and differentiations with respect to the weights are interchanged. Despite such formal similarities, we argued that weight noise has a different effect on the generalization performance than the input noise. Weight noise constrains the neural network to be less sensitive to variations in the weights instead of variations in the inputs. We further demonstrated that weight noise encourages sigmoidal units to operate close to saturation and discourages both large weights in the output layer and large activations in the hidden layer. On both test problems, input noise significantly reduces the generalization error. On our classification problem, weight noise improves the generalization as much as the input noise. However, in the test-case regression problem weight noise improves the generalization substantially less than input noise. Owing to the limited scale of the test case, this may not be generally true. We argued that training with annealed Langevin noise results in a global minimization similar to that of simulated annealing. Therefore, training with annealed Langevin noise has no regularization effect, although it could be effective in finding the global minimum, as demonstrated by our experiment.
Adding Noise During Backpropagation
671
Appendix: Noise Injection During Normal Backpropagation Training In this appendix, we show how the analysis presented in Section 3 can be extended to the normal backpropagation training. Define a vector that represents the whole training set by ZO= (z1.z2, . . . , P I.. , zN) . and make the dependency of E(w) on the training set explicit by denoting it by E ( Z 0 , w) = E(w). Let Z be a vector in the same space as ZO. The standard error E(w) can again be written into the form of equation 3.3, with the correspondence u to Z, c(u,w) to E ( Z , w) and g(Z) = ~ ( Z - Z O ) By . applying the stochastic gradient-descent algorithm to this form of E(w), we rederive the backpropagation training algorithm. Therefore, both the on-line backpropagation and the normal backpropagation training algorithm can be treated as special cases of the stochastic gradient-descent algorithm. Consider now a set of N data points represented by Z = (ZI,Z 2 , . . . , Z,) where Z, denotes a data point that we have until now denoted by zp. In the vector space of Z, the training set is represented by a single vector Zo = (z’ , z2, . . . ,zN). Recall that the backpropagation training is described by the following weight update equation: Wttl
= Wt - rlrVwE(Z0,w)
(A.1)
This training algorithm, although deterministic in nature, may be treated as a special case of the stochastic gradient-descent algorithm by interpreting ZOas a realization of the following distribution:
g(Z) = 6(Z - ZO) Bearing the above in mind, it is clear that with the data noise injection procedure detailed in equation 2.5, the training algorithm is now described by Wt+l = Wf
- rlfVWE(Z, w)
(A.2)
where Z is randomly drawn from the density
Therefore, the cost function that is minimized by equation A.2 has the form C(w) =
/ E ( Z ,w)g(Z) dZ
(A.4)
Substituting equations 2.2 and A.3 into equation A.4 after simplifying the integration over Z, we have
672
Guozhong An
Renaming the dummy integration variable Z , in equation A.5 z, we obtain
c
l N C(w) = - /e(z, w)p(z - ZW) nz
N p=l
(A.6)
This is exactly the &(w)given by equation 3.5, with g(z) given by equation 3.1.
Acknowledgments
I thank Wim Schinkel for many helpful suggestions that have lead to numerous improvements in the presentation of this work. References Arfken, G. 1985, Mathematical Methods for Physicists. Academic Press, New York. Bishop, C. M. 1995. Training with noise is equivalent to Tikhomov regularization. Neural Comp. 7, 108-116. Bottou, L. 1991. Stochastic gradient learning in neural networks. NEURONIMES’91, EC2, Nanterre, France, 687-606. Chauvin, Y. 1989. A back-propagation algorithm with optimal use of hidden units. In Advances in Neural Information Processing System I, D. Touretsky, ed., pp. 519-526. Morgan Kaufmann, San Mateo, CA. Clay, R., and Sequin, C. 1992. Fault tolerance training improves generalization and robustness. Proc. Int. Joint Conf. Neural Networks, IEEE Neural Council, Baltimore, 1-769-774. Drucker, H., and Le Cun, Y. 1992. Improving generalisation performance using double back-propagation. I E E E Trans. Neural Networks 3, 991-997. Geman, S., and Hwang, C. 1986. Diffusion for global optimization. S I A M 1. Control Optim. 25, 1031-1043. Gillespie, D. 1992. Markozi Processes. Academic Press, London. Guillerm, T., and Cotter, N. 1991. A diffusion process for global optimization in neural networks. Proc. Int. Joint Conf. Neural Networks, 1-335. Guyon, A., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. 1992. Structural risk minimization for character recognition. In Advances in Neural Information Processing System 4 (NIPS 91), J. Moody et al., eds., pp. 471-479. Morgan Kaufmann, San Mateo, CA. Hanson, S. J. 1990. A stochastic version of the delta rule. Physica D 42, 265-272. Hertz, J., Krogh, A,, and Thorbergsson, G. 1989. Phase transitions in simple learning. J. Phys. A: Math. Gen. 22, 2133-2150. Hinton, G. E. 1986. Learning distributed representations of concepts. Proc. Eighth Annu. Conf. Cog. Sci. SOC.,Amherst, 1-12. Hinton, G. E., and van Camp, D. 1993. Keeping neural networks simple by minimizing the description length of the weights. Sixth ACM Conf. Comp. Learning Theory (Santa Crud, 5-13.
Adding Noise During Backpropagation
673
Holmstrom, L., and Koistinen, P. 1992. Using additive noise in back-propagation training. IEEE Trans. Neural Networks 3, 24-38. Kendall, G. D., and Hall, T. J. 1993. Optimal network construction by minimum description length. Neural Comp. 5, 210-212. Kendall, M., and Stuart, A. 1977. The Advanced Theory of Statistics, Vol. 1, 4th ed., Charles Griffin, London. Kirkpatrick, S., Gelatt, C., and Vecchi, M. 1983. Optimization by simulated annealing. Science 220, 671-680. Krogh, A., and Hertz, J. A. 1992. A simple weight decay can improve generalization. In Advances in Neural Information Processing System 4 (NIPS 9U, J. E. Moody et at., eds., pp. 950-957. Morgan Kaufmann, San Mateo, CA. Kushner, H. 1987. Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo. S I A M J. A p p l . Math. 47, 169-185. MacKay, D. J. C. 1992. Bayesian interpolation. Neural Comp. 4, 415-447. Matsuoka, K. 1992. Noise injection into inputs in back-propagation learning. I E E E Trans. Sys. Man Cybern. 22, 436440. Murray, A. F., and Edwards, P. J. 1993. Synaptic weight noise during multilayer perceptron training: Fault tolerance and training improvements. I E E E Trans. Neural Networks 4, 722-725. Murray, A. F., and Edwards, I? J. 1994. Enhanced mlp performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. Neural Networks 5, 792-802. Neuralware. 1991. Reference Guide: NeuralWorks Professional Il/PlUS and NeuralWorks Explorer. Neuralware, Inc, Pittsburgh. Nowlan, S. J., and Hinton, G. E. 1992. Simplifying neural networks by soft weight-sharing. Neural Comp. 4,473493. Pao, Y.H. 1989. Adaptive Pattern Recognition and Neural Networks, chap. 2, pp. 2535. Addison-Wesley, Reading, MA. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. I E E E 78, 1481-1497. Reed, R., Oh, S., and Marks, R. J., 11. 1992. Regularization using jittered training data. Proc. Int. Joint Conf. Neural Networks, IEEE Neural Council, Baltimore, 111-147. Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore. Rognvaldsson, T. 1994. On Langevin updating in multilayer perceptrons. Neural Comp. 6, 916-926. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature (London) 323,533-536. Seung, H., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056-6091. Sietsma, J., and Dow, R. 1988. Neural network pruning-why and how. Proc. I E E E Int. Conf. Neural Networks I, 325-333. Sjoberg, J., and Ljung, L. 1992. Overtraining, regularisation, and searching for minimum in neural networks. Proc. Symp. Adaptive Systems Control Signal Process., Grenoble, France.
674
Guozhong An
van Kampen, N. G. 1981. Stochastic Processes in Physics arid Cheniistry. NorthHolland, Amsterdam. Weigend, A., Rumelhart, D., and Huberman, B. 1991. Generalization by weightelimination with application to forecasting. In Advances in Neurnl Znformation Processing System 3 (NIPS 90), J. M. R. P. Lippmann and D. Touretsky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Comp. I, 425-464.
Received January 10, 1995; accepted July 14, 1995.
This article has been cited by: 2. Taeho Jo. 2010. The effect of mid-term estimation on back propagation for time series prediction. Neural Computing and Applications 19:8, 1237-1250. [CrossRef] 3. Kevin I.-J Ho, Chi-Sing Leung, John Sum. 2010. Convergence and Objective Functions of Some Fault/Noise-Injection-Based Online Learning Algorithms for RBF Networks. IEEE Transactions on Neural Networks 21:6, 938-947. [CrossRef] 4. Adam Petrie, Thomas R. Willemain. 2010. The snake for visualizing and for counting clusters in multivariate data. Statistical Analysis and Data Mining n/a-n/a. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Richard M. Zur, Yulei Jiang, Lorenzo L. Pesce, Karen Drukker. 2009. Noise injection for training artificial neural networks: A comparison with weight decay and early stopping. Medical Physics 36:10, 4810. [CrossRef] 7. C. P. Unsworth, G. Coghill. 2006. Excessive Noise Injection Training of Neural Networks for Markerless Tracking in Obscured and Segmented EnvironmentsExcessive Noise Injection Training of Neural Networks for Markerless Tracking in Obscured and Segmented Environments. Neural Computation 18:9, 2122-2145. [Abstract] [PDF] [PDF Plus] 8. P. Chandra, Y. Singh. 2004. Feedforward Sigmoidal Networks—Equicontinuity and Fault-Tolerance Properties. IEEE Transactions on Neural Networks 15:6, 1350-1366. [CrossRef] 9. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 10. Edward Wilson, Stephen M. Rock. 2002. Gradient-based parameter optimization for systems containing discrete-valued functions. International Journal of Robust and Nonlinear Control 12:11, 1009-1028. [CrossRef] 11. A. Alessandri, M. Sanguineti, M. Maggiore. 2002. Optimization-based learning with bounded error for feedforward neural networks. IEEE Transactions on Neural Networks 13:2, 261-273. [CrossRef] 12. Vicente Ruiz de Angulo , Carme Torras . 2001. Architecture-Independent Approximation of FunctionsArchitecture-Independent Approximation of Functions. Neural Computation 13:5, 1119-1135. [Abstract] [PDF] [PDF Plus] 13. Y. Grandvalet. 2000. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks 11:6, 1201-1212. [CrossRef]
14. Katsuhisa Hirokawa, Kazuyoshi Itoh, Yoshiki Ichioka. 2000. Invariant Pattern Recognition Using Neural Networks Combined with Optical Wavelet Preprocessor. Optical Review 7:4, 284-293. [CrossRef] 15. Chuan Wang, J.C. Principe. 1999. Training neural networks with additive noise in the desired signal. IEEE Transactions on Neural Networks 10:6, 1511-1517. [CrossRef] 16. P.J. Edwards, A.F. Murray. 1998. Fault tolerance via weight noise in analog VLSI implementations of MLPs-a case study with EPSILON. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 45:9, 1255-1262. [CrossRef] 17. Yves Grandvalet, Stéphane Canu, Stéphane Boucheron. 1997. Noise Injection: Theoretical ProspectsNoise Injection: Theoretical Prospects. Neural Computation 9:5, 1093-1108. [Abstract] [PDF] [PDF Plus]
Communicated by Don R. Hush
ARTICLE
Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants Christian W. Omlin NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 USA C. Lee Giles NEC Research Institute, 4 lndependence Way, Princeton, NJ 08540 USA and UMIACS, University of Maryland, College Park, M D 20742 U S A
We propose an algorithm for encoding deterministic finite-state automata (DFAs) in second-order recurrent neural networks with sigmoidaI discriminant function and we prove that the languages accepted by the constructed network and the DFA are identical. The desired finite-state network dynamics is achieved by programming a small subset of all weights. A worst case analysis reveals a relationship between the weight strength and the maximum allowed network size, which guarantees finite-state behavior of the constructed network. We illustrate the method by encoding random DFAs with 10,100, and 1000 states. While the theory predicts that the weight strength scales with the DFA size, we find empirically the weight strength to be almost constant for all the random DFAs. These results can be explained by noting that the generated DFAs represent average cases. We empirically demonstrate the existence of extreme DFAs for which the weight strength scales with DFA size. 1 Introduction
It is possible to train recurrent neural networks to behave like deterministic finite-state automata (Elman 1990; Frasconi et al. 1991; Giles etal. 1992; Pollack 1991; Servan-Schreiber et al. 1991; Watrous and Kuhn 1992). The internal representation of learned DFA states can deteriorate due to the dynamic nature of recurrent networks making predictions about the generalization performance of trained recurrent networks difficult (Zeng et al. 1993). Methods for constructing DFAs in recurrent networks with hardlimiting neurons discriminant functions have been proposed (Alon et al. 1991; Horne and Hush 1994; Minsky 1967);methods for constructing networks with sigmoidal and radial-basis discriminant functions have been discussed (Frasconi et al. 1993; Gori et al. 1994; Giles and Omlin 1993). We prove that recurrent networks with continuous sigmoidal discriminant functions can be constructed such that the encoded finite-state dynamics Neural Coinputation
8, 675696 (1996)
@ 1996 Massachusetts Institute of Technology
676
Christian W. Omlin and C. Lee Giles
remains stable indefinitely. Notice that we do not claim that such a stable representation can be lenmed. 2 Encoding DFA Dynamics 2.1 Finite State Automata. A deterministic finite-state automaton (DFA) is an acceptor for a regular language L ( M ) . Formally, a DFA M is a 5-tuple M = (C. Q.R. F. 6) where Y = { u I . . . . . air,} is the alphabet of the language L, Q = { q , . . . . ,q,,} is a set of states, R E Q is the start state, F C Q is a set of accepting states, and h : Q x C + Q defines state transitions in M. A string is accepted by the DFA M if an accepting state is reached; otherwise, the string is rejected.
2.2 Recurrent Network. We implement DFAs in discrete-time, recurrent networks with second-order weights Wllk. The continuous network dynamics are described by the following equations:
where b, is the bias associated with hidden recurrent state neurons S,; l k denotes input neurons. The product S:I/ directly corresponds to the state transition O(q,,ak) = ql. After a string has been processed, the output of a designated neuron SO decides whether the network accepts or rejects a string. The network accepts a given string if the value of the output neuron Sh at the end of the string is greater than 0.5; otherwise, the network rejects the string. For the remainder of this paper, we assume a one-hot encoding for input symbols uk, i.e., I: E (0. l}. 2.3 Encoding Algorithm. The encoding algorithm achieves a nearly orthonormal internal representation of the desired DFA dynamics; it constructs a network with n + 1 recurrent state neurons (including the output neuron) and m input neurons from a DFA with iz states and 112 input symbols. There is a one-to-one correspondence between state neurons S, and DFA states 4 , . For each DFA state transition b ( q , . n ~ ) = q,, we set Wllk to a large positive value +H. The self connection W,,, is set to -H (i.e., neuron S, changes its state from high to low) except for state transitions Cl(q,>uk) = q, (self-loops) where W,,, is set to +H (i.e., state of neuron S, remains high). Furthermore, if state q, is an accepting state, then we program the weight WO,, to +H; otherwise, we set Walk to -H. We set the bias terms b, of all state neurons S, to -H/2. For each DFA state transition, at most three weights of the network have to be programmed. The initial state So of the network is So = (S:. 1.0,O.. . . 0). The value of the response neuron S: is 0 if the DFA's initial state qo is a rejecting state
Stable Encoding of Large Finite-State Automata
677
and 1 otherwise. All weights that are not set to -H, -H/2, or +H are set to zero. The question this paper addresses is whether the value of H can be chosen such that the finite-state dynamics in a recurrent network remains indefinitely stable. 3 Analysis
We prove the stability of DFA encodings in recurrent neural networks for strings of arbitrary length. Due to space limitations, we give only the proofs of the theorems that establish our results; for proofs of auxiliary lemmas see Omlin and Giles (1994). 3.1 Fixed Point Analysis for Sigmoidal Discriminant Function. Recall that the recurrent network changes its state according to equation 2.1. Our DFA encoding algorithm yields a special form of that equation describing the dynamics of a constructed network:
(3.1) The bias term -HI2 is common to all state neurons. Hxj is the weighted sum feeding into neuron S j t + l ) . Under certain conditions, the discriminant function h ( . ) has fixed points that allow a stable internal representation of DFA states. 3.2 Network Dynamics as Iterated Functions. When a network processes a string, the state neurons go through a sequence of state changes. The network state at time t 1 is computed from the network state at time t, its current input, and its weights. Since the discriminant function h ( . j is the same for all state neurons, these network state changes can be represented as iterations of h ( . ) for each state neuron:
+
A network will correctly classify strings of arbitrary length only if its internal DFA state representation remains sufficiently stable. Stability can be guaranteed only if the neurons are shown to operate near their saturation regions for sufficiently high gain of the sigmoidal discriminant function h ( . ) . One way to achieve stability is thus to show that the iteration of the discriminant function h ( . ) converges toward its fixed points in these regions, i.e., points for which we have, i.e., h(x.H ) = x. This observation will be the basis for a quantitative analysis, which establishes bounds on the network size and the weight strength H, which guarantee stable internal representation for arbitrary DFAs.
Christian W. Omlin and C. Lee Giles
678
Given a stable DFA encoding where neurons operate near their saturated regions, each neuron can send two kinds of signals to other neurons: 1. High signals: If neuron Si represents the current DFA state ql, then S: will be high (S::high). 2. Low signals: Neurons S: that d o not represent the current DFA state have a low output 6;: low).
Recall that the arguments of the discriminant function I i ( x . H ) were the sum of unweighted signals s and the weight strength H . We now expand the term s to account for the different kinds of signals that are present in a neural DFA. From the DFA encoding algorithm, we can derive four different types of neuron state changes: 1071'
- high: (3.3)
(3.5)
(3.6)
(3.7) (3.8) where C,,a = { S , I WjIk= H.i # i. I # j }
(3.9)
The inputs 1; are not shown explicitly since we assume that each input symbol is assigned a separate input neuron in a one-hot encoding. The DFA state transitions corresponding to these types of neuron state changes are shown in Figure 1. The signals Si and S: represent the principnl coiitrihiitioizs to the neuron Si-' that are responsible for driving the output of neuron S:*' low or
Stable Encoding of Large Finite-State Automata
I
679
I
@’
@’
I+
I
l+l
Figure 1: Neuron state changes and corresponding DFA state transitions: The figures (a)-(f) illustrate the DFA state transitions corresponding to all possible state changes of neuron Sl; the DFA state(s) participating in the current transitions are marked with t and t + 1. (a) low + high (with self-loop on ql}, (b) low + high (with self-loop on 9,), (c) high + high (necessarily a self-loop on ql), (d) high + low (necessarily no self-loop on 9J, (e) low + low (with self-loop on 9,), (f) low + low (no self-loop on q r ) . Notice that even though state 9l is neither the source nor the target of the current state transition in cases (e) and (f), the corresponding state neuron S , still receives residual inputs from state neurons Sl,, . . . ,SI,.
Christian W. Omlin and C. Lee Giles
680
high. All other terms are the residual contrihiitioizs to the input of neuron S:". The term CSi contributes to the total input of state neuron S:+' if there are other transitions cl(9l.nk) = q, in the DFA from which the recurrent network is constructed. Since there is a one-to-one correspondence between state neurons and DFA states, there will always be a negative contribution -Si for the current DFA state transition d ( q , . n ~ )= q1 (assuming q1 # ql), i.e., only Si can drive the signal S;'' low. The above equations account for all possible contributions to the net input of all state neurons because the encoding algorithm constructs a sparse recurrent network. For a worst case analysis, it suffices to investigate the cases of minimum and maximum neuron inputs for high and low signals, respectively. Equations 3.3-3.9 condense to the following two equations: lou1
-
10111:
(3.10)
low + high: (3.11) We now define a new function ha(x,. H) that takes the residual inputs into consideration. Let AX: denote the residual neuron inputs to neuron Then, the function kz'(xi,H) is recursively defined as
The initial values for low and high signals are xp = 0 and xp = 1, respectively. The magnitude of the residual inputs Ax, depend on the coupling between recurrent state neurons. Neurons that are connected to a large number of other neurons will receive a larger residual input than neurons which are connected to only a few other neurons. Consider the neuron S,,, which receives a residual input Ax, from the most number r of neurons, i.e., Ax; 5 Ax,,. To show network stability, it suffices to assume the worst case where all neurons receive the same amount of residual input for given time index t, i.e. Ax;. This assumption is valid since the initial value for all neurons except the neuron corresponding to a DFA's start state is 0. 3.3 Fixed Point Analysis for Sigmoidal Discriminant Function. In order to guarantee low signals to remain low, we have to give a tight upper bound for low signals which remains valid for an arbitrary number of time steps.
Stable Encoding of Large Finite-State Automata
681
Figure 2: Fixed Points of the Sigmoidal Discriminant Function: Shown are (dashed graphs) for H = 8 and the graphs of the function f ( x . r ) = I+eH(;-2,,j,z r = (1.2.4. lo} and the function p(x,u) = I+eH(l-2(,-,,)j,2 1 (dotted graphs) for H = 8 and u = {0.0.0.1.0.4.0.9}. Their intersection with the function y = x shows the existence and location of fixed points. In this example, f ( x . r ) has three fixed points for r = {132}, but only one fixed point for I' = (4.10) and p ( x . u ) has three fixed points for u = {0.0.0.1}, but only one fixed point for u = {0.6.0.9}. Lemma 3.3.1. The low signals are bounded from above by the fixed point 9; of
the function f
{
fa = 0
(3.13)
ft+' = h(r . f')
i.e. we have Ax:+' = r . f ' since xp = 0 for low signals in equation (3.12). This lemma can easily be proven by induction on t. It is easy to see that the function to be iterated in equation (3.13) is 1 f ( x . r ) = 1 + e ( H / 2 ) ( 1 - 2 r x ) The graphs of the function are shown in Figure 2 for different values of the parameter r. The function f (x.r ) has some desirable properties (Omlin and Giles, 1995): '
Lemma 3.3.2. For a n y H > 0, thefuncfionf(x.u ) kasaf least onefixed point #. Lemma 3.3.3. There exists a ualue H;(u) such that for any H > Hi ( r ) ,f ( x . r ) has threefixed points 0 < 0,-< 0; < 4; < 1.
Christian W. Omlin and C. Lee Giles
682
Lemma 3.3.4. I f f ( x .r ) has three fixed points 47,@, and , :4 then
(3.14)
The above lemma can be shown by defining an appropriate Lyapunov function L and showing that L has minima at 4 ; and and that f ' converges toward one of these minima. Notice that the fixed point 4; is unstable.
$7
Lemma 3.3.5. Let f " ? f I . f'. . . . denote thefinite sequencecomputed by successive iteration of thefunction f . Then we have f o < f' < . . . < q5f where df is one of the stable fixed points off ( x .r ) . With these properties, we can quantify the value H;(r) such that for any H > H;(r), f ( x >r ) has three fixed points. The low and high fixed points qh-, and 4; will be the bounds for low and high signals, respectively. The larger r, the larger H must be chosen in order to guarantee the existence of three fixed points. If H is not chosen sufficiently large, then f' converges to a unique fixed point 0.5 < 4, < 1. The following lemma expresses a quantitative condition which guarantees the existence of three fixed points: 1 Lemma 3.3.6. Thefunctionf(x,r)= has threefixed points 0 < e(H,Z)i,-2r,) $6; < 4; < @: < 1 if H is chosen such that +
H > H{(r)=
2( 1 + (1 - x ) log(+)) I-x
where x satisfies the equation r=
1
2x(1 + (1 - x ) log(?))
The contour plots in Figure 3 show the relationship between H and x for various values of Y . If H is chosen such that H > H o ( r ) ,then three fixed points exist; otherwise, only a single fixed point exists. The number and the location of fixed points depends on the values of Y and H . Thus, we write @;(r.H), # ( r > H ) ,and $ / ( r . H ) , to denote the stable low, the unstable, and the stable high fixed point, respectively. We will use @ as a generic name for any fixed point of a function f . Similarly, we can quantify high signals in a constructed network:
683
Stable Encoding of Large Finite-State Automata
I
0
02
0.6
04
06
1
X
Figure 3: Existence of Fixed Points: The contour plots of the function h ( x . r ) = x (dotted graphs) show the relationship between H and x for various values of r. If H is chosen such that H > Ho(r) (solid graph), then a line parallel to the x-axis intersects the surface satisfying h ( x . r ) = x in three points which are the fixed points of h(x,r ) ,
Lemma 3.3.7. The high signals are bounded from below by the fixed point the function
4;
of
(3.15)
This lemma is easily proven by induction if we assume the worst case for neuron state transitions low -+ high where that neuron receives no residual inputs that would strengthen the high signal. Notice that the above recurrence relation couples f ' and g' which makes it difficult if not impossible to find a function g(x, r) which when iterated gives the same values as g'. However, we can bound the sequence go,gl, g2,. . . from below by a recursively defined function p'-i.e., Vt : p' 5 ,$--which decouples g' from f'.
Christian W. Omlin and C. Lee Ciles
684
Lemma 3.3.8. Let (,$fdciiofe fhefised poziit of the rerifrszvefzriicfzo~z f , i.e. lim f ' fx '+if. The17 the r e c ~ ~ r s ~ zdefiried ~ e l y firiicfzoii p
=
(3.16) 1717s tl7e property that
v t : p' 5 8'.
This can be proven by induction on t. The graph of the function p(x. 7 1 ) for some values of u are shown in Figure 2. The lemmas 3.3.2 through 3.3.5 also apply to the function p(x. 10.
Lemma 3.3.9. Let the fziizction p(x. 14 j hnve t7oo sfable fixed points a i d let V t :
IJ' 5 8 ' . Thcri fhc firrictioii g(x. rj has also tzvo stable fixed poirits. Since we have decoupled the iterated function g' from the iterated function f ' by introducing the iterated function p', we can apply the same technique for finding conditions for the existence of fixed points of p(x. u ) as in the case off'. In fact, the function that when iterated generates the sequence yo. p' .p 2 . . . . is defined by (3.17)
with
H'
= H(l
+ 2q;)).
(3.18)
Since we can iteratively compute the value of & for given parameters H and r, we can repeat the original argument with H' and r' in place of H and r to find the conditions under which p ( r ,x) and thus g ( r .x ) have three fixed points. This results in the following lemma:
Lemma 3.3.10. The function p ( x . d;j
=
1
+ e ~ ~ , ~ I ~has, three - ~ filed , , ~ ~ ~ ~ ~
poiizts 0 < 4; < 0; < 0: < 1 if H is chosen such that
where x satisfies the equation 1
--
1+24;
1
-
241
+ (1 -x)log(?))
Stable Encoding of Large Finite-State Automata
685
3.4 Network Stability. We now define stability of recurrent networks constructed from DFAs: Definition 3.4.1. A n encoding of DFA states in a second-order recurrent neural nefwork is called &if all the low signals are less than c$(r. H ) , and all the high signals are greater fhaiz @ ( r . H).
Consider equation (3.10). In order for the low signal to remain less than o!, the argument of h ( . ) must be less than ,3fo for all values of f . Thus, we require the following invariant property of the residual inputs:
H
+ H v o ; < o,! (3.19) 2 where we assumed that all low signals have the same value and that their maximum value is the fixed point $5;. This assumption is justified since the output of all state neurons with low values is initialized to zero. A similar analysis can be carried out for state transitions of equation (3.11). The following inequality must be satisfied for stable high signals: --
-H2 + H&!
-
Ha;; >
*;
(3.20)
where we assumed that there is only one DFA transition 6(q,.ak) = q, for chosen qi and a k , and thus CS,tC,,k = 0. Solving inequalities (3.19) and (3.20) for 4- and @,:< respectively, we obtain conditions under which a constructed recurrent network implements a given DFA. These conditions are expressed in the following theorem: Theorem 3.4.1. For somegiven DFA M with n stafes and m input symbols, let r
denote the maximum number of transitions to any state over all input symbols of + 1 sigrnoidal state neurons and m input neurons can be constructed from M such that the internal state representation remains stable if the following thvee conditions are satisfied:
M. Then, a sparse recurrent neural network with n
(3.21) (3) H > max(Hc(r),H;(r)) Furthermore, the constructed network has at most 3mn second-order weights with alphabet C, = { -H. 0, +H}, n + 1 biases with alpkabet Cb = { - H / 2 } , and maximum fan-out 3m. The number of weights and the maximum fan-out follow directly from the DFA encoding algorithm. Stable encoding of DFA state is a necessary condition for a neural network to implement a given DFA. The network must also correctly classify all strings. The conditions for correct string classification are expressed in the following corollary:
Christian W. Omlin and C. Lee Giles
686
Corollary 3.1. Let L(MDFA)denote the regular language accepted by a DFA M with n states and let L(MR") be the langirage accepted by the recurrent network constructed froin M . Then, we have L ( M R N N= ) L ( M D F Aif)
(2) H > rnax(Hi(r).H,+(r)) Proof. For the case of an ungrammatical string, the input to the response neuron So must satisfy the following condition: 1 H - _ - H$ + ( n - 1 ) H q < (3.22) 2 2 where we have made the usual simplification about the convergence of the outputs to the fixed points 4; and q5i. Furthermore, we assume that the state ql of the state transition 6(q17/.ak) = q; is the only rejecting state; then the output neuron's residual inputs from all other state neurons is positive, weakening the intended low signal for the networks output neuron. Notice that the output neuron is the only neuron that can be forced toward a low signal by neurons other than itself. A similar condition can be formulated for grammatical strings: H 1 -H4: - ( n - l)H4; > (3.23) 2 2 The above two inequalities can be simplified into a single inequality:
+
-2HdLi + 2(n - l)ff$; < 0
(3.24)
Observing that 4; + 4; < 2 and solving for 47, we get the following condition for the correct output of a network: 2 (3.25) 4; < ; Thus we have the following conditions for stable low signals and correct string classification:
(3.26)
(classification) We observe that
1 Choosing < - thus implies the condition for stable low signals in 2n partially recurrent networks. Substituting & for 4; in inequality (3.26) 0 yields condition (1) of the corollary.
$7
Stable Encoding of Large Finite-State Automata
687
Figure 4: Randomly generated 100-state DFA: The minimal DFA has 100 states and alphabet C = { O % l } . State 1 is the start state. States with and without double circles are accepting and rejecting states, respectively. 4 Experiments 4.1 Simulation Results. To empirically validate our analysis, we constructed networks from randomly generated DFAs with 10,100, and 1000 states. For each of the three DFAs, we randomly generated different test sets each consisting of 1000 strings of length 10, 100, and 1000, respectively. The randomly generated, minimized 100-state DFA with alphabet C = (0.1) that we encoded into a recurrent network with 101 state neurons is shown in Figure 4. The networks' generalization performance on these test sets for rule strength H = (0.0.0.1.0.2.. . . 7.0) are shown in Figures 5-7. A misclassification of these long strings indicates a networks failure to maintain the stable finite-state dynamics that was encoded, However, we observe that the networks can implement stable DFAs as indicated by the perfect generalization performance for some
688
Christian W. Omlin and C. Lee Giles
RNN Encoding of 10-state DFA 0.4
0 35
"string.length-1000
-0
0.3
0.25
0.2
0.15
0.1
0.05
0 0
1
2
3
H
4
5
6
Figure 5: Performance of 10-stateDFA: The network classificationperformance on three randomly generated data sets consisting of 1000 strings of length 10 (O), 100 (+), and 1,000 (O),respectively, as a function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classification on the strings of length 1000 for H > 6.0. choice of the rule strength H and chosen test set. Thus, we have empirical evidence that supports our analysis. All three networks achieve perfect generalization for all three test sets for approximately the same value of H . Apparently, the network size plays an insignificant role in determining for which value of H stability of the internal DFA representation is reached, at least across the considered 3 orders of magnitude of network sizes.
4.2 Discussion. In our simulations, few neurons ever exceeded or fell below the fixed points 4- and $+, respectively. Furthermore, the network has a built-in reset mechanism that allows low and high signals to be strengthened. Low signals Si are strengthened to h ( 0 , H ) when there exists no state transition h ( . % a h )= 4,. In that case, the neuron S: receives no inputs from any of the other neurons; its output becomes less than 6- since h(O.H) < 4- for H > 4. Similarly, high signals S: get
689
Stable Encoding of Large Finite-State Automata
RNN Encoding of 100-state DFA
x
L
g
w
0
1
2
3
4
5
6
7
H
Figure 6: Performance of 100-state DFA: The network classification performance on three randomly generated data sets consisting of 1000 strings of length 10 (O), 100 (+), and 1000 (O), respectively, as a function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classification on the strings of length 1000 for H > 6.2. strengthened if either low signals feeding into neuron S, on a current state transition b( (4,). u k ) = q, have been strengthened during the previous time step or when the number of positive residual inputs to neuron S, compensates for a weak high signal from neurons (4,). Thus only a small number of neurons will have Si > p- or Si < d+. For the majority of neurons we have Si 5 4- and Si 2 @+. Since constructed networks are able to regenerate their internal signals and since typical DFAs do not have the worst case properties assumed in this analysis, the conditions guaranteeing stable low and high signals are generally much too strong for some given DFA. 5 Scaling Issues 5.1 Preliminaries. The worst case analysis in Section 3 supports the following predictions about the implementation of arbitrary DFAs:
Christian W. Omlin and C. Lee Giles
690
RNN Encoding of 1000-state DFA 0.6
I
I
I
I
"string.length-10 +"string.length-100 -+"string.length-1000 -0.-
OowooWoD t++++++*+O
0.5
-5
0.4
2
-8
03
._0 Y
5
v1
0.2
0.1
n 0
1
2
3
4
5
6
7
H
Figure 7: Performance of 1000-state DFA The network classification performances on three randomly generated data sets consisting of 1000 strings of length 10 (0),100 (+), and 1,000 (O), respectively, as a function of the rule strength H (in 0.1 increments). The network achieves perfect classification on the strings of length 1000 for H > 6.1. 1. Neural DFAs can be constructed that are stable for arbitrary string length for finite value of the weight strength H . 2. For most neural DFA implementations, network stability is achieved for values of H that are smaller than the values required by the conditions in Theorem 3.4.1. 3. The value of H scales with the DFA size, i.e., the larger the DFA and thus the network, the larger H will be for guaranteed stability.
Predictions (1)and (2) are supported by our experiments. However, when we compare the values H in the above experiments for DFAs of different sizes, we find that H z 6 for all three DFAs. This observation seems inconsistent with the theory. The reason for this inconsistency lies in the assumption of a worst case for the analysis, whereas the DFAs we implemented represent average cases. For the construction of the randomly generated 100-state DFA we found correct classification of strings of length 1000 for H = 6.3. This value corresponds to a DFA whose states
Stable Encoding of Large Finite-State Automata
11
691
,
I
I
I
5 m
g v)
E
P
3
0
20
I
I
1
40
60
80
100
maximum indegree
Figure 8: Scaling weight strength: An accepting state q,, in 10 randomly generated 100-stateDFAs was selected. The number of states q1 for which 6(q,. 0) = qp was gradually increased in increments of 5% of all DFA states. The graph shows the minimum value of H for correct classification of 100 strings of length 100. H increases up to p = 75%; for p > 75%, the DFA becomes degenerated causing H to decrease again. have 'average' indegree r = 1.5. [The magic value 6 also seems to occur for networks which are trained. Consider a neuron S;; then, the weight that causes transitions between dynamical attractors often has a value = 6 (Tino 1994).] However, there exist DFAs that exhibit the scaling behavior that is predicted by the theory. We will briefly discuss such DFAs. That discussion will be followed by an analysis of the condition for stable DFA encodings for asymptotically large DFAs. 5.2 DFA States with Large Indegree. We can approximate the worst case analysis by considering an extreme case of a DFA:
(1) Select an arbitrary DFA state qp; (2) select a fraction p of states 9j and set 6(9j,uk)
= qp.
692
Christian W. Omlin and C. Lee Giles
(3) For low values of p, a constructed network behaves similarly to a randomly generated DFA. (4) As the number of states q, for which h(q,.ak) = 4,) increases, the behavior gradually moves toward the worst case analysis where one neuron receives a large number of residual inputs for a designated input symbol ak. We constructed a network from a randomly generated DFA Mo with 100 states and two input symbols. We derived DFAs M,,,. M,?. . . . M,,, where the fraction of DFA states q, from MI,< to M,,,, with h(q,.ak) = q,, increased by A/); for our experiments, we chose Ap = 0.05. Obviously, the languages L ( M , ) change for different values of pl. The graph in Figure 8 shows for 10 randomly generated DFAs with 100 state the minimum weight strength H necessary to correctly classify 100 strings of length 100-a new data set was randomly generated for each DFA-as a function of p in 5% increments. We observe that H generally increases with increasing values of p; in all cases, the hint strength H sharply decline for some percentage value p. As the number of connections +H to a single state neuron S, increases, the number of residual inputs that can cause unstable internal DFA representation and incorrect classification decreases. Let us assume that the extreme DFA state 9(, is an accepting state. Then, the input to output neuron S;" is (5.1) For correct classification, the net input must be larger than 0.5. As the value of p increases, the number of terms in the first and second sum increase and decrease, respectively. Thus, smaller values of H lead to correct string classification. A similar argument can be made if q/, is a rejecting state. We observe that there are two runs where outliers occur, i.e., Hp, > H,,,,even though we have pI < p I + l . Since the value H,, depends on the randomly generated DFA, the choice for q,, and the test set, we can expect such an uncharacteristic behavior to occur in some cases. 5.3 Asymptotic Case Analysis. We are interested in finding an expression for the average number of residual inputs to a neuron in large DFAs. Since we are dealing with a second-order network architecture, disjoint parts of the network participate in the computation of the next state for any given input symbol. Thus, we can limit our analysis to DFAs with a single input symbol. Consider a DFA M and its underlying graph G(V.E ) whose vertices V and directed edges E are the DFA states Q and state transitions 6, respectively. We assume that G ( V .E ) is randomly generated: For any given vertex u,,a directed edge e, is drawn to another vertex ZI, with equal probability l / n for all vertices of G. The number of directed edges entering
Stable Encoding of Large Finite-State Automata
693
any given vertex ui from other vertices u,,, is the number of residual inputs state neuron Si receives from other state neurons S,,,. Thus we need to compute only the expected number of incoming edges ("in-degree") for a DFA generated according to the above probability distribution. The probability p ( d = k ) for a vertex to have in-degree k follows a binomial distribution; thus, the average in-degree is given by the expected value of k, which can be written as
For n -+ 03 and X = np = 1 where p is the probability that an event occurs (in our case we have p = l / n and thus X = 1) and p + 0 the binomial distribution asymptotically converges toward the Poisson distribution: oc
k e-"
E{d = k} = k=l
With X
= 1,
Xk -
k!
we conclude
The lemmas of section 3.3 simplify for the case r and Giles, 1995):
=
1 as follows (Omlin
Lemma 5.3.1. For 0 < H < 4, h(x.H ) has the following fixed point:
do = 0.5 Furthermore, h(x.H ) converge to $"for any choice of a start value XO. Lemma 5.3.2. ForH
2 4, k ( x . H ) has threefixed points $'
Lemma 5.3.3. for x <
(4' < x ) ; k"x,
= 0.5,d-
and $+
H ) (with H > 4) converges to 9-
(4+). Lemma 5.3.4. For arbifray H > 4, the two fixed points
4- and $+ are related
as follows:
We can now state the following asymptotic result for the construction of large DFAs: Theorem 5.3.1. Let n denote the number of states in a DFA. For n i m, the languages accepted by the DFA M and the constructed neural R" are identical only for H -+ M.
Christian W. Omlin and C. Lee Giles
694
Proof. Recall the worst case equations 3.10 and 3.11 for state transitions of type low + low and low -, high. For the asymptotic case ii -+ 00, these equations simplify. The worst case equations of section 3.2 apply here also; however, the residual inputs are zero. Thus the following two conditions for stable low and high signals, respectively, must be satisfied:
H 1 --+HQF<2 2
(5.2)
and (5.3)
Combining these two inequalities and solving for d; leads to the condition d; < Thus, we have the following conditions for stability of the finite-state dynamics and correct string classification in asymptotically large DFAs:
5.
stability ofsignals :
@;(H)<
correct string classificatioiz :
1
0; ( H ) <
1
2n
Unlike in the case of the worst case analysis for partially recurrent networks, the condition for correct string classification dominates the conditions for stable finite-state dynamics. As a matter of fact, stable finitestate dynamics alone does not require H + co; however, correct string + 0 and thus H -+ 00 for n -+ 03. classification requires
$7
6 Conclusion
We investigated how deterministic finite-state automata (DFAs) can be encoded into sparse second-order recurrent neural networks. The operation performed by the second-order architecture is akin to DFA state transitions, making DFA encoding a straightforward operation. We have proven that our algorithm can construct a sparse recurrent network with O ( n )state neurons, O(mn)weights and limited fan-out of size O ( m )from any DFA state with n states and nz input symbols such that the DFA and the constructed network accept the same regular language. The DFA dynamics is achieved by programming some of the weights to values +H or -H. A worst case analysis has revealed a quantitative relationship between the rule strength Hpredand the maximum allowed network size such that the network dynamics remains robust for arbitrary string length. This is only a proof of existence, i.e., we do not make any claims that such a solution can be learned.
Stable Encoding of Large Finite-State Automata
695
Our empirical results suggest that the weight strength H M 6 is independent of the network size for typical DFAs. Extreme DFAs can be constructed for the weight strength scales with the network size. The stability analysis presented in this paper can be extended to the case where DFAs are embedded into fully recurrent networks. This may be desirable in the case where a network is initialized with partial prior knowledge to be refined through learning on training data. In that case, the weights which are not programmed to -H: -H/2, and +H are initialized to small random values drawn according to some distribution from an interval [-W. W]. Then, the significance of stable DFA encoding is that the parameters H and W can be chosen such that knowledge is not destroyed by the presence of the randomly initialized weights. It would be an interesting question to investigate whether a denser binary representation of DFA states is possible, thus requiring fewer than n + 1 state neurons to encode a DFA with n states in a second-order recurrent neural network. We hypothesize that dense neural DFA construction is an NP-complete problem (Hopcroft and Ullman 1979).
Acknowledgments We would like to acknowledge useful discussions with D. Handscomb (Oxford University Computing Laboratory), B. G. Horne (NEC Research Institute), and helpful comments from the reviewers.
References
Alon, N., Dewdney, A., and Ott, T. 1991. Effecient stimulation of finite automata by neural nets. J. Assoc. Computing Mach. 38(2), 495-514. Alquezar, R., and Sanfeliu, A. (1995). An algebraic framework to represent finite state machines in single-layer recurrent neural networks. Neural Comp. 7(5), 931.
Elman, J. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Frasconi, P., Gori, M., Maggini, M., and Soda, G. 1991. A unified approach for integrating explicit knowledge and learning by example in recurrent networks. Proc. I n f .Joint Conf. Neural Networks 1, 811. IEEE 9lCH3049-4. Frasconi, P., Gori, M., and Soda, G. 1993. Injecting Nondeterministic Finite State Automata info Recurrent Networks. Tech. Rep. Dipartimento di Sistemi e Informatica, Universita di Firenze, Italy, Florence, Italy. Giles, C., and Omlin, C. 1993. Extraction, insertion and refinement of symbolic rules in dynamically driven recurrent neural networks. Connect. Sci. 5(3 & 4), 307-337. Giles, C., Miller, C., Chen, D., Chen, H., Sun, G., and Lee, Y. 1992. Learn-
ing and extracting finite state automata with second-order recurrent neural networks. Neural Comp. 4(3), 380.
696
Christian W. Omlin and C. Lee Giles
Gori, M., Maggini, M., and Soda, G. 1996. Insertion of finite state automata in recurrent radial basis function networks. Machine Learning, in press. Hopcroft, J., and Ullman, J. 1979. Introduction to Autoinata Theory, Langunges, and Computation. Addison-Wesley, Reading, MA. Horne, B., and Hush, D. 1994. Bounds on the complexity of recurrent neural network implementations of finite state machines. In Advances i n Neural Inforination Processing Systems 6 , J. Cowen, G. Tesauro, and J. Alspector, eds., pp. 359-366. Morgan Kaufmann, San Mateo, CA. Minsky, M. 1967. Computation: Finite nnd Infinite Machines, Chap. 3, pp. 32-66. Prentice-Hall, Englewood Cliffs, NJ. Omlin, C., and Giles, C. 1995. Constructing Deterministic Finite-state Autoiizata in Spnrse Recurrent Neural Netruorks. Technical Report UMIACS-TR-90 and CS-TR-3460, University of Maryland, College Park, Md. Pollack, J. 1991. The induction of dynamical recognizers. Machine Lenrniizg 7, 227-252. Servan-Schreiber, D., Cleeremans, A., and McClelland, J. 1991. Graded state machine: The representation of temporal contingencies in simple recurrent networks. Mnchine Learning 7, 161. Tino, P. 1994. Private communication. Watrous, R., and Kuhn, G. 1992. Induction of finite-state languages using second-order recurrent networks. Neural Comp. 4(3), 406. Zeng, Z., Goodman, R., and Smyth, P. 1993. Learning finite state machines with self-clustering recurrent networks. Neurnl Comp. 5(6), 976-990.
Received September 26, 1994; accepted March 15, 1995
This article has been cited by: 1. M. Gori, A. Petrosino. 2004. Encoding Nondeterministic Fuzzy Tree Automata Into Recursive Neural Networks. IEEE Transactions on Neural Networks 15:6, 1435-1449. [CrossRef] 2. A. Vahed, C. W. Omlin. 2004. A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural NetworksA Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks. Neural Computation 16:1, 59-71. [Abstract] [PDF] [PDF Plus] 3. Barbara Hammer , Peter Tiňo . 2003. Recurrent Neural Networks with Small Weights Implement Definite Memory MachinesRecurrent Neural Networks with Small Weights Implement Definite Memory Machines. Neural Computation 15:8, 1897-1929. [Abstract] [PDF] [PDF Plus] 4. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 5. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 6. R.C. Carrasco, M.L. Forcada. 2001. Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE Transactions on Knowledge and Data Engineering 13:2, 148-156. [CrossRef] 7. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 8. C.L. Giles, C.W. Omlin, K.K. Thornber. 1999. Equivalence in knowledge representation: automata, recurrent neural networks, and dynamical fuzzy systems. Proceedings of the IEEE 87:9, 1623-1640. [CrossRef] 9. Chun-Hsien Chen, V. Honavar. 1999. A neural-network architecture for syntax analysis. IEEE Transactions on Neural Networks 10:1, 94-114. [CrossRef] 10. M. Gori, A. Kuchler, A. Sperduti. 1999. On the implementation of frontier-to-root tree automata in recursive neural networks. IEEE Transactions on Neural Networks 10:6, 1305-1314. [CrossRef] 11. C.W. Omlin, K.K. Thornber, C.L. Giles. 1998. Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems 6:1, 76-89. [CrossRef]
12. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution". IEEE Transactions on Neural Networks 7:4, 1047-1051. [CrossRef] 13. C.W. Omlin, C.L. Giles. 1996. Rule revision with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 8:1, 183-188. [CrossRef]
Communicated by Steven Nowlan
A Theory of the Visual Motion Coding in the Primary Visual Cortex Zhaoping Li Computer Science Department, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
This paper demonstrates that much of visual motion coding in the primary visual cortex can be understood from a theory of efficient motion coding in a multiscale representation. The theory predicts that cortical cells can have a spectrum of directional indices, be tuned to different directions of motion, and have spatiotemporally separable or inseparable receptive fields (RF). The predictions also include the following correlations between motion coding and spatial, chromatic, and stereo codings: the preferred speed is greater when the cell receptive field size is larger, the color channel prefers lower speed than the luminance channel, and both the optimal speeds and the preferred directions of motion can be different for inputs from different eyes to the same neuron. These predictions agree with experimental observations. In addition, this theory makes predictions that have not been experimentally investigated systematically and provides a testing ground for an efficient multiscale coding framework. These predictions are as follows: (1) if nearby cortical cells of a given preferred orientation and scale prefer opposite directions of motion and have a quadrature RF phase relationship with each other, then they will have the same directional index, (2) a single neuron can have different optimal motion speeds for opposite motion directions of monocular stimuli, and (3) a neuron’s ocular dominance may change with motion direction if the neuron prefers opposite directions for inputs from different eyes. 1 Introduction
Primary visual cortical cells sensitive to motion and selective to motion directions have been observed physiologically since the works of Hubel and Wiesel (1959, 1962). Simple cells are found to be tuned to directions of motion to various degrees in addition to their selectivities to orientation, spatial frequency, ocular origin, etc. (Holub and Morton-Gibson 1981; Foster et al. 1985; Reid et al. 1991; DeAngelis et al. 1994). This paper demonstrates that many of the motion-sensitive/directional-selective properties in cortical simple cells can be understood as consequences of efficient coding of visual inputs in a multiscale framework. Such an Neural Computation 8, 705-730 (1996)
@ 1996 Massachusetts Institute of Technology
706
Zhaoping Li
understanding provides detailed predictions of the simple cell spatiotemporal receptive field (RF) properties. These predictions can be compared with known observations or experimentally tested. Efficiency of information representation has long been advocated as the coding principle for early stages of sensory processing (Barlow 1961). This is because the natural signals have structures and regularities. Visual inputs, for example, have correlated signals in image pixels, making some input signals largely predictable from others. Such regularities make pixel-by-pixel input representation highly redundant or inefficient, in the sense that the same information is signaled wastefully many times through different neural channels. An efficient code with reduced redundancy not only gives coding and neural implementation economy, but also arguably provides cognitive advantages (Barlow 1961) due to the knowledge of input statistics, which has to be inherent in the code to reduce the redundancy One of the most noticable visual input redundancies is the pairwise pixel-pixel correlations. Concentrating on such redundancy, several recent works have formulated efficient coding in the language of information theory or decorrelation/factorial codes modified appropriately under noise (Srinivasan Pt nl. 1982; Linsker 1989; Atick and Redlich 1990; Bialek ef a / . 1991; Nadal and Parga 1993). In particular, efficient coding has provided a theory of retinal processing and predicted the spatiochromatic receptive fields of the retinal ganglion cells agreeing with those observed physiologically (Srinivasan ef al. 1982; Atick and Redlich 1990; Atick et al. 1992). There are other types of regularities in natural images that we believe the visual system beyond the retina takes advantage of. One such regularity is translation and scale invariance, namely, the image of an object a t one location or distance can predict much of the image of the same object at another location or distance. It has recently been proposed that one of the preprocessing goals of the early visual cortex is, without compromising coding efficiency, to produce a representation where actions of translation and scaling are manifested or factored out toward object invariance (Li and Atick 1994a). The resulting, so-called multiscnle, representation remaps the visual field into multiple retinotopic maps identical in all respects except for the densities and RF sizes of their sampling nodes. This representation is also a step toward redundancy reduction when it is followed by attentional mechanisms to compensate the manifested translation and scaling changes to produce object invariant neural activity patterns (Li and Atick 1994a,b). Efficient coding in the multiscale representation has predicted many of the simple cell RF properties in the spatial, chromatic, and stereo domains (Li and Atick 1994a,b). These predictions include the simple cell selectivities to orientation, spatial frequency, color, ocular origin, disparity, as well as the particular frequency tuning bandwidth, phase quadrature structure between neighboring cells, and spatiochromatic-stereo interactions in cell
Visual Motion Coding in Visual Cortex
707
selectivities observed experimentally. The theoretical understanding further aided the study of the visual system by motivating experimental tests of some predictions that had not been investigated experimentally (Li 1995; Anzai et al. 1994). However, the temporal input dimension was ignored in these earlier theoretical works (Li and Atick 1994a,b; Li 1995). The current work demonstrates that including the temporal dimension enables the same framework to additionally predict simple cell motion sensitivities and directional selectivities that have been observed or can be tested experimentally. The primary visual cortex is likely to have other functions in mind in addition to the goal of efficiency and invariance. Previous works (Li and Atick 1994a,b; Li 1995) did not take into account other possible cortical functions and were limited to only linear coding mechanisms as approximations. This necessarily led to unexplained cortical phenomena and quantitative disagreements between reality and theoretical predictions (see discussion). Extending the previous approach to motion coding, the current work has the same limitations. However, it helps to explore the potential and limitations of the efficient coding framework and provide a testing ground for it with additional predictions that have not yet been experimentally investigated. Various neurophysiological, psychophysical, and computational motion models have been proposed (e.g., Reichardt 1961; Torre and Poggio 1978; Marr and Ullman 1981; van Santen and Sperling 1984; Adelson and Bergen 1985; Watson and Ahumada 1985). They are mostly designed to model the neuronal mechanisms underlying directionality or to provide computational algorithms for visual motion detection and computation. Some of them (e.g., Reichardt 1961; Torre and Poggio 1978; Marr and Ullman 1981; van Santen and Sperling 1984) have highly nonlinear components at an early stage, either to ensure directionality or to compute motion velocity. Physiological observations, however, reveal essentially linear mechanisms underlying directionality in simple cells (Reid et al. 1991; Jagadeesh etal. 1993). Motion models of Adelson and Bergen (1985) and Watson and Ahumada (1985) do include linear components before a latter-stage nonlinearity and are designed for motion sensing or detection within the constraints of known physiological and psychophysical observations. The current work derives the motion coding using a linear mechanism from the requirement of efficient multiscale representation, without a priori specifying the purpose of visual motion computation or selectivity. Its predictions include some that have not been experimentally investigated in addition to ones that agree with known observations. A special case from the derivations will be shown to resemble the linear components in the models of Adelson and Bergen (1985) and Watson and Ahumada (1985). The next section presents the theoretical formulation of the efficient motion coding in the multiscale representation. Section 3 explores the predicted RFs and correlations between motion coding and codings in
Zhaoping Li
708
the space, color, and stereo domains, to compare them with experimental observations. Section 4 summarizes the results and discusses the limitations and desired experimental tests of the theory. 2 Efficient Motion Coding in a Multiscale Representation
Visual input is inefficient because the input S(x,t ) ,assumed to be of zero mean for simplicity, at retina location x and time t is correlated with S(x’.t’) by the amount
R, b,,
E
( S ( X .t)S(X’. t ’ ) )
(2.1)
where (.) denotes average over inputs of the visual environment. Without loss of generality, the retina is taken as one-dimensional. Visual inputs are assumed to be statistically translation invariant and reflection symmetric, such that R, t,( = R,+nt+T,,~+,lf~+r = R(x-x’,t-t’) = R[i(x-x’).*(t-t’)]. Then R can also be characterized by its Fourier transform’ R6.u) = (1/27rm C), J_“,dtR(x. = R(.tf,>id),which is also the average input power in frequency V;>.J). Here N is the total number of input units covering a visual space x E ( 0 . N ) with unit grid spacing. Under noiseless conditions, a more efficient code O(j.t) can be constructed within the linear coding scheme by a transform tl
f)e-ifj‘-rwt
O(j.t ) = 1
/x
dt‘Ky. t ; x , f’)S(x.t’)
--x
such that the outputs are decorrelated (O(j.t)O(j’.f ’ ) ) = h,,lh(t - t ’ ) . If higher order input correlations are ignored, such decorrelated outputs O(j.t ) imply that no information is redundantly sent through different t ) is thus efficient. One output units or at different times. The code O( j> should note that we require both spatial and temporal decorrelation, in contrast to the mere spatial decorrelation when the temporal dimension was ignored (Li and Atick 19944. The temporal dimension cannot be treated like the spatial dimension because of causality. In addition, visual object scale invariance does not extend from space to time. Accordingly, the multiscale coding, which is necessitated in the spatial domain by the scale invariance (Li and Atick 19944, has no a priori reason to be applied temporally.2 A special efficient code 0, is obtained by passing S ( x . t ) through a spatial filter Kk, to achieve spatial decorrelation, and a temporal filter Kj‘, ‘The same symbol R is used for the correlation function R ( s . t ) as well as its Fourier transform R(f. w).The arguments (x.t ) or (f.w )specify the actual function concerned. Such practices are used throughout the paper for some other functions and variables as well to avoid proliferation of notation. ’There may be a posteriori reasons for multiscale in time, for instance, to compute motion velocity (Grzywacz and Yuille 1990, see Section 3). However, at least in the primary visual cortex, the selectivity to temporal scale is much poorer (Holub and Morton-Gibson 1981; Foster et n l . 1985) than that to spatial scale.
Visual Motion Coding in Visual Cortex
709
to achieve temporal decorrelation:
S(x. t )
--$
S&, t )
=
CK$(X)S(X,t ) 1
-+
06,t ) = K$(x)
=
1
a:
dt'Kj(t - t')SCf/,t')
(2.2)
-a:
1 -e-'/i"
each j has a different
rn
(2.3)
spatial frequency
K!(t
- f')
K ( j . t , x , f')
1 27r
1 dwR-'/*&,w)e-'"('-")-'~(f,
Kfl
E
-
=
x
d)
--oc
K!(t
-
f')K<(x)
(2.4)
(2.5)
where ~ ( f , ? w ) = -q5&> - w ) is chosen such that the temporal filter K!(t-f') is causal [K!( t < 0) = 01 and has minimum temporal spread3 for each 1. Decorrelation in 00.t ) can be verified as follows. The signal Scf,. t ) is the spatial Fourier transform of S(x. f ) for spatial frequency f,. Accordingly, Scf,. f ) and s*&(. f ) , for f / # are decorrelated from each other in a translationally invariant system f/l
RCf;.t - t')b,,,
(2.6)
where superscript asterisk denotes complex conjugate. Each Scfi. t ) is a temporally correlated signal, which is temporally decorrelated by the transform S & . t ) + 06.t ) = J-", dt'K!(t - t')S&. f') with the temporal whitening filter K;, whose Fourier transform has an amplitude R-'12Cf,.d):
=
/
1 due"('-'') - 0( t - t') 2 n . -x. .
'73
(2.7)
The spatiotemporal RF Kfi = KcKt for this efficient code 0, = KfiS is, however, not spatially local or retinotopic, simply because Kh contains a
Jr
' ~ ( ' i .envelope(t) ~), and 3Define Afl(t) = (l/a) d ~ R - ' / 2 ~ , ~ ) ~ - ' ~ ( r - " ) - taking phase(t) as the amplitude and phase of Ah(t), then KF(t) = envelope(f) cos[phase(t)]. The minimum temporal spread of Kf is defined when d t ( t - ?)' envelope(f), where i = JTxdf envelope(t), is minimum.
s-",
710
Zhaoping Li
spatial Fourier wave K$, which is n ~ n l o c a l .In~ addition K$ is unique for each output j with unique frequency f,, requiring a unique RF for each output cell. However, other efficient codes can be constructed from this one (Li and Atick 1994a) by any unitary transform U [where the boldfaced U denotes a matrix, UUt = 1, and Ut = (U.)T1with 0, + &, U,lO,t and KfJ + U,lKf,’. Decorrelation is preserved in the new code since Ul,(Ul~,f)*(O,O~) = t&. As noted in the introduction, it was proposed that the goal of the cortex is to construct a multiscale representation, which is also spatially local, retinotopic, and translationally invariant, in the sense that the RF of each cell is the same as that of many other cells in the same scale except for the RF center locations. An efficient code of this multiscale nature is achieved (see Li and Atick 1994a for details) by combining the original filters KfJor outputs 06, t ) within each frequency band f” < I f , l 5 f”+l = 3f” by a unitary transform Ua in that band: (2.8) (2.9)
where R indicates the spatial scale or frequency band, and KY,is the spatiotemporal RF for the nth output unit in that scale, n = 1,2,. . . .No K Cfn+’ -f”). As is shown in the Appendix, the general spatiotemporal receptive fields of such nature are K:(x;f
-
t’)
LrndwKCf. LJ)(A+COS[~(X) + d(f)]
K f”
+ A- cos[4(x) =
-
d(t)l)
(2.10)
C J , d d K ( f ,w){(A+ + A-) cos[$(x)]cos[@(f)]
f “
+ (A-
-
A’) sin[@(x)] sin[d(t)]}
(2.11)
with
4(x)
=
f(x%- x) - 7rrz/2
+
+@ +4
$ ( t ) = w(t- f’) 4(f,w) z w ( t - f’ - fP) df
(A+&,
&)
=
+
(2.12)
(A:, A,, @, 4:,) if n is even (A;,A:, @, &,) if n is odd
(2.13) (2.14)
4The RFs for this code are not real, but this representation is used for mathematical convenience and it does not affect the final results.
Visual Motion Coding in Visual Cortex
711
where Kcf. d ) = Xp1/*cf. d) denotes the spatiotemporal sensitivity of the filters, (A+)'+ (A-)2 = 1, xf, = (N/N")iz or xf,= (N/N')(!z IZ mod 2 ) is the RF center' in the unit of the visual input grid size, and t,, > 0 approximates the filter latency, which is determined by p(j.d). The five parameters (A: .At:. o*.($,.c$) specify the RFs for all neural units IZ = 1 . 2 . . . . . N",and different choices of them give different, but equivalently efficient, coding representations. The RF centers xr, of the neural units IZ = 1 . 2 . . . . . N" in this scale II are distributed over the whole visual space x E (0.N) with the Nyquist sampling rate-the number of neurons N" in this scale is proportional to the bandwidth f"+'- f", with two neurons covering every two sampling periods 2N/N"(Fig. 1). Let us examine K::(x;t - f'), the spatiotemporal RF of the nth unit in the 0th scale. It is selective to spatial frequencies f E Cf".f"+') and all temporal frequencies w' with a sensitivity proportional to KCf. d).The spatial and temporal part of the filter are embodied in {cos[c5(x)].sin[c~(x)]} and {cos[cl(t)].sin[o(t)]),respectively. When x = x;,, the RF spatial phase is ch(x) = -7rn/2 + byfor all f-phase coherence-implying x: as the RF center. The RF amplitude reaches its peak at x = xyl and quickly decays as x moves away from it. This phase coherence and a finite bandwidth f E Cf".f"+') ensures the filter locality with a spatial spread Ax l/(f"+'- f " ) by the uncertainty principle. Similarly, the temporal phase coherence, o ( t )= constant for all w,is achieved at t - f' = t,,, the temporal latency, as implied by the temporal locality of the filter.6 Translation invariance is achieved since the RF for the nth neural unit is the same as that for the (n+2)nd unit, K%(x;t-t') = -Kfr+2[x-(~y1+2-#l);f-f/], except for a shift in the RF centers and up to a polarity change. [It is not possible to have the same RF for every neuron in an efficient code when the spatial frequency bandwidth in the cortex is larger than one octave (Li and Atick 1994a).] Note that a drifting grating cos[fx+spatial phase& (wt+temporal phase )] has a drifting velocity 'u = d w / f . Our neurons then respond to the two motion directions with relative amplitudes A+ and A-, respectively, and have a directional index D.I. = JJA+J - IA-lI/(JA+J+ IA-I). Variations of (A:. 4". 4;.4;) generate a whole family of spatiotemporal RFs of various directionality and RF phases. At one extreme when
+
N
5Here both choices, $ = ( N / N " ) n and $, = ( N / N " ) ( n+ n mod 2), are valid. In Li and Atick (1994a), however, only the first choice is given. hAs we stated earlier, d(f,ul) is chosen to make K[(t) causal and have minimum temporal spread, implying the temporal coherence U T ~ + , 4Cf.w) z constant for all w given an f . A change (pcf3 w ) + d ( f 3w ) + a still satisfies the requirement; and 4(f>w ) 4 U . w ) - W T for T > 0 merely prolongs the filter latency T ~ , T&,+ r. Although the depends on f, it is possible to choose t, as the largest minimum latency T~ = within a limited band y%fu+'), such that fxp + wtp + $(f. w ) x constant can be satisfied for xy = 0 or any x y without compromising causality. Since K l f , w ) varies very little within the limited band (see later), t, can be very close to the shortest latency for every spatial frequency component f in the band. Similar temporal phase structures have been observed in experiments (Hamilton et al. 1989).
-
-
rr
Zhaoping Li
712
x K;,( x)Kfl( t - t ' )
(2.15)
where
p.
This neuron is not at all directionally selective, as where fpeak = intuitively expected from equation (2.15) which approximates the filters as spatiotemporally ~eparable.~ From here on, such filters will be viewed as separable. The other extreme case is when A+ = 1 and A- = 0:
which is selective to only one motion direction. This neuron has a spatiotemporally inseparable RF composed of a pair of spatiotemporal separable filters, Cr.
Jr
7This approximation is valid when K(f, w ) is a smooth function o f f and changes little within a limited frequency range Cf",fa+') (see next section).
713
Visual Motion Coding in Visual Cortex
Opposite preferred directions Neighboring units: Phase quadrature relationship Same directional index nth unit
*
(n+l)th unit
tuned to left motion m+l
mt2
(n+2)th unit
(n+3)th unit
tuned to nght motion m+3 m+4 m+5
'"
...
(a+l)th scale
Space
Figure 1: Schematic illustration of the efficient motion coding in the multiscale representation. The RFs for the two neighboring scales are shown. The space and time are in horizontal and vertical directions, respectively. A perfectly oriented bar or edge in space-time implies complete cell directionality, as is used in this figure for example. The slope and sign of the orientation correspond to the preferred speed and direction of motion. Note that (1) the neighboring units have the quadrature phase relationship, opposite preferred directions of motion, and the same directional index, (2) the preferred motion speed decreases as the cell RF size decreases (see Section 3). The RF centers of the neighboring units are displaced by a distance comparable to the RF sizes; this displacement is exaggerated in the figure to avoid RF overlap for clear illustration. units [i.e., the nth and ( n
(Af/A-),, (D.1.)11
=
+ 1)th units, see Fig. 11:
(A-/A+)!,+i
= (0.1.)1,+1
+
O ~ ~ ( X=~ ~4n+l(~yf+,) ) a/2
opposite motion direction preferences (2.16) same directional index
(2.17)
quadrature relationship between spatial phases
(2.18)
(see equations (2.14) and (2.12). When testing this neighbor relationship in the cortex, one should choose neurons that (1) have the same optimal spatial frequency since they belong to the same scale a, (2) are tuned to the same orientation as implied by our one-dimensional mathematical or ( N / N " ) ( n+ M mod 2), treatment, ( 3 ) have RF centers x", = (N/N")n
Zhaoping Li
714
which are displaced by xR+, - x% = 0, or N/Na, or 2N/N", comparable to that of the efficient Nyquist sampling period N/N',which is roughly half a grating period of the optimal spatial frequency. Hence one should distinguish the neighboring units mentioned here from the anatomical neighboring cells in the cortex, which could be tuned to different optimal spatial frequencies and orientations, etc. (see discussion). Furthermore, the two observed cells in experiments may be neighboring units (nth and I I + 1)th units as uiell us second neighboring units [nth and ( n +2)th units] to each other, if their RF center displacements are not carefully monitored. In such cases, the two cells' directional preferences are likely to be the same as well as opposite. This is observed physiologically, where neighboring cells tend to prefer the same, and sometimes opposite, but fewer times orthogonal, directions of motion (Berman et al. 1987). It is desirable to test whether those preferring opposite directions and having quadrature phase relationships also have the same directional index. In addition, one has the following observation from the theory. Given any scale and orientation, there is no need to have two filters tuned to opposite directions at each spatial location to have a complete representation. An efficient code needs an average of only one filter tuned to one direction per sampling interval-a pair of quadrature filters tuned to opposite directions for every two Nyquist sampling intervals (see discussion). To illustrate the spatiotemporal RF, we need the knowledge of K c f . J), which depends on R(f. w) and should be modified under noise. Different measurements have suggested that in natural scenes Rcf.w = 0) K l/f' (Field 1987) and Rcf = 0 . w ) K 1/w2 (Dong and Atick 1994). Without additional knowledge, this paper models Ref, w) K cf2 + <*w2)-',where i = 0.4 cycle.sec/degree is chosen to give a final contrast sensitivity K(f.w) peaking around 8 Hz for low f as observed psychophysically (Nakayama 1985). As one will see below, the qualitative results in this paper depend only on XU.&) decaying with increasing f and ul. Hence the exact R ( f .d )or is not crucial. The complete decorrelation requires K ( f . w ) = [ R c f . ~ ) ] - ~ /(see * equation 2.101, which increases with cf.&) to amplify the lower signal power R(f. w) at higher cf. J).This leads to undesirable noise amplification when at high cf. J ) the weak signal R ( f . ul) is overwhelmed by noise RN, which is assumed to be white noise and, therefore, RN = constant over ( f . ~ )A. noise smoothing strategys is employed to lower K c f , w) whenever the signal-to-noise R(f. u l ) / R ~is small. The generic feature of K c f , w ) is (Fig. 2)
<
K l f . ~ ) increases withf.w when R(f.w) at small (f . w)
>> RN (2.19)
sNoise smoothing gives K l f . w ) = M(f.w ) K l f , u)Inolsrleis (following Atick and Redlich 1992; Li and Atick 1994b, and noise smoothing also follows from information theoretical arguments), where M x XU.d),"RCf%w )+ X,] is a low pass smoothing filter. In detail, what is used in the paper are A4lf.w) = [ R / ( X l ) ]e x p [ - ( f / f , K ( f . .J) 2 M/M'(X + R = 16.0/lf2 + [w? +fL?), fi, = 0.3 c/deg, j(. = 22 c/deg. 1 )+
+
715
Visual Motion Coding in Visual Cortex
The construction of the filter sensitivity function
Figure 2: At the top is the filter sensitivity K, which deviates from the noiseless case Knoiseless = R-'/* at high frequency, where the signal R is weak, to smooth out noise. K" is the sensitivity for scale a centered around f = 1 cycle(c)/deg. The temporal dimension is ignored for clarity in the plot. The lower plots are RF examples for pairs of neighboring units, next to each other with the left one as the even unit of the pair, each under the parameter value set (D.I..@'.&&,), which generates them. The space and time are in horizontal and vertical directions, respectively, and each RF is centered at the RF center (x,t ) = (I$.t' t p ) . The gray levels depict the filter amplitudes, gray for near zero amplitudes, white and black for larger positive and negative amplitudes, respectively. The preferred spatial frequency is fpeak = 1 c/deg. A perfectly oriented bar or edge in space-time implies complete directionality, and a spatiotemporal separability implies nondirectionality. Note (1) the neighboring units have the quadrature phase relationship, opposite preferred directions, and the same directional index D.I., (2) changes in the RFs as D.I. decreases in left column, (3) differences in RF phases between left and right column pairs of the same directionality. These RFs are obtained by the approximation K",(x, t ) x (A+ A-)K,(x)Kt(t) (A- - A+)K,(x)k,(t), where K t ( t ) N JFdwKaCfpeak,W)cos[@(t)], K t ( t ) x J?dwKaCfpeak,w) sin[4(t)], ~ , ( x )(x JFdfK'Cf, u p e a k ) cos[d(x)J, K,(X)(x Jo" df~'Cf.upak)sin(@(x)]. his approximation and figure format are used in other figures of this paper as well.
+
+
+
Zhaoping Li
716
K(f. ’J)
decreases with f . J when Rcf. ‘ J ) / R N is small at large c f . ‘ ~ )
(2.20)
K ( ~ . L Jpeaks ) at some intermediate (f.w), where the signal R c f . ~starts ) to be overwhelmed by noise. Hence if R = S’/cf’ + C2d’), then K(f...>) peaks at lower cf. d )for smaller S2. In the multiscale representation, we further model (Fig. 2) Ct,,<,,i5f”+t Kcf. ’J) by dfK‘cf.’J), where K“(f. ~ j =) K(f. d )exp{-[log(f/fpeak)/a]’/2} and m = log( fi)is to model a 1.6 octave bandwidth (Li and Atick 1994a) of the frequency selective channel with optimal frequency fpeak = Figure 2 illustrates some examples of the spatiotemporal RFs of neighboring cells using these models.
p.
3 Correlation between Motion Coding and Visual Codings in Space, Color, and Stereo
We explore additional predictions from the motion coding theory to compare them with experimental observations or subject them to experimental tests. This can be carried out by studying the correlations between motion coding and codings in space, color, and stereo. It was shown in Section 2 that for signal power R = S2/Cf2+ F2w2), the peak sensitivity Kcfpeak%u p r a k ) will occur at a lower frequency (fprak. u p e a k ) when the signal power R, or S2, is smaller. In particular, the temporal sensitivity curve K 1f ( d ) = K ( ~ . ‘ J for ) each spatial frequency f also peaks at some w’ = ‘JP-k cf). Hence w F e a k ( f ) decreases as S2 decreases or f increases. This has immediate consequences on cross-channel coding correlations when one notices that the signal power magnitude R depends on the frequency f , on whether the signal is achromatic or ocularly opponent, etc., as will be shown below. 3.1 Correlation between Spatial Coding and Motion Coding. This theory thus predicts that the cell optimal speed decreases with increasing optimal spatial frequency (see Fig. 3), as observed in experiments (Holub and Morton-Gibson 1981; Foster et al. 1985). This is because for a neuron with optimal spatiotemporal frequency Ifpeak. upeak), the preferred motion speed is roughly u N wPeak/fpeak. The prediction follows since both l/fpeak and, from the argument above, upeak, decrease with increasing f . The model R ( f .w )= S2/Cf2 t2u2)gives a slowly decreasing or roughly constant u P e a k C f ) for a range of low spatial frequencies f (Fig. 3), suggesting a roughly inverse relationship u l/fpeak. At a higher f , whose exact value depends on the signal-to-noise or S2, uPeakCf) starts to decrease sharply with f , and temporal sensitivity K{ (w) becomes significantly low-pass and u approaches zero. The same trend of uPeakCf) is observed physiologically (Holub and Morton-Gibson 1981) and psychophysically (Kelly 1979). The physiologically measured upeak varies from cell to cell by up to a factor around 10 for a given cell optimal f
+
N
Visual Motion Coding in Visual Cortex
717
Temporal sensitivity curves for different f
Figure 3: Changes of temporal sensitivities and spatiotemporal receptive fields with the optimal spatial frequency f. The filter orientation in space-time has a steeper slope as f increases, implying decreasing preferred motion speeds. Parameters used: D.I. = I, dr = 0, and @f= 90". (Holub and Morton-Gibson 1981; Foster et al. 1985). Such variations can not be accounted for by the present theory, and may serve other computational purposes, e.g., Grzywacz and Yuille (1990) have used them for velocity computation. However, cortical cells have a wide, around 3 octaves, temporal frequency bandwidth (Holub and Morton-Gibson 1981; Foster et al. 1985). This width is comparable to, and likely contributed to, the measured spread in deak(,f). Another prediction is best sensitivity to contrast reversal grating sensitivity to drifting grating of preferred direction
{
if D.I. = 0 (3.11 0.5 as D.I. + 1
=1
i
This stems from equations 2.10 and 2.11, which suggest gains of 0: 1/2(IA+l + IA-0 and 0: Ai, respectively, to the two grating types. Psychophysically, the detection threshold for counter-phased gratings is almost twice of that for drifting gratings over a wide spatiotemporal frequency range (Levinson and Sekuler 1975; Watson et al. 1980). These
718
Zhaoping Li
Temporal sensitivity curves
Figure 4: Temporal sensitivity and spatioteniporal RFs for luminance and chrominance channels. Parameters used D.I. = 1, qY = 0, and 4' = 90°, fpeak = 1 c/deg, and the signal power in the chromatic channel is 4% of that in the luminance channel. A smaller optimal motion speed in the chromatic channel is apparent. observations were explained by noting that two half-contrast drifting gratings of opposite directions sum to a full contrast reversal grating (see Burr 1991). The current prediction, however, is on a single cell level and relies on the assumed linear mechanisms. Significant cortical nonlinearity (Reid et al. 1991) should give a quantitatively different reality, however, but the trend of decreasing ratio above with increasing directional index should still hold and can be tested. 3.2 Correlation between Color and Motion Coding. This theory also predicts a smaller optimal speed for the chromatic channel (see Fig. 4), since the chromatic signal power [hence w ~ ~ & a t l c c f )is] smaller than luminance signal power [or ~ ~ , ~ ~ ~ , , , This ( f ) l is. consistent with the observation that the perceived motion slows down dramatically at isoluminance (Cavanagh et al. 1984). The color channel is traditionally viewed as insensitive to motion (see Nakayama 1985). However, there are recent psychophysical and physiological evidences of chromatic contribution to motion detection (Dobkins and Alright 1993, 1994). At a single striate cortical cell level, chromatic and luminance signals are multiplexed (see Li and Atick 1994a). Accordingly, the actual motion sensitivity in a single color selective cell is complicated, and should depend on whether stimuli are isoluminant or not. 3.3 Correlation between Stereo and Motion Coding. Stereo coding (Li and Atick 1994b; Li 1995) is composed of ocular summation (the input summation from the two eyes) and ocular opponency (the input difference between the two eyes) channels. Let Ksum(x,t) and Kopp(x,t )
Visual Motion Coding in Visual Cortex
719
be the RFs for the summation and opponency channels, respectively. We have
+ + 4;) + A; C O S ~ ~-Xwf I+) : $
x [A: C O S ~ ~ Xwt
(3.2)
for c = sum. opp. Here all phase contributions, e.g., fx",, and $Cf> w ) (see equation 2.10), that do not depend on (x, f ) are summed into variables $*. (The subscript n for the neuron and superscript a for scale are omitted for clarity.) The binocular RFs in a cortical cell are (Li and Atick 1994b; Li 1995) Kl(x.f ) = Ksum(x,t ) + K(x, f ) for the left and K,(x, t ) = Ksum(x.t ) K,,,(x, t ) for the right eyes: Keye(x.t )
=
lrn/mdw[Kd,,cf, 0
df
+ K,-,Cf,
0
w) coscfx
w) coscfx - wt
+ wt + d),:,)
+ $FYJ
(3.3) (3.4)
with eye = 1. Y. Here K&Cf. w) and KgY,cf. w) are the monocular sensitivities to stimuli of opposite motion directions. The questions are: what wzk/f
(3.5) The signal power for ocular summation and opponency are R,,, = (1 + r)RCf,w) and Ropp = (1 - r)RCf.w), respectively, where 0 < i' < 1 is the input ocular correlation normalized by the self-correlation within a single eye.' The inequality R,,, > RoPPimmediately gives a larger optimal speed in the summation channel > uoPp, and in addition, the channel sensitivities Ksumcf>w) and K,,,(j. w) should differ and they are not simply related by a gain factor: Ks,,Cf.w) pG KoppCf.w).Consequently by equation 3.5, K,',,Cf.w) $ K,,(f.w) and K:cf.w) pG K f c f . ~ ) . This means, in a single neuron, the RFs for the two eyes can differ in detailed form as well as in overall sensitivity, and the contrast sensitivity curves K6,Cf.w) of two directions also differ by more than a gain factor. Accordingly, this theory predicts (1) the optimal speed ,:.v N yl$pk/f2yk can vary with eye origin and the motion direction; (2) some neurons change their preferred motion direction with ocular origin or change ZJ,,~
9The temporal dimension of visual inputs was ignored in the earlier works (LI and Atick 1994b; Li 1995) and the ocular signal powers were denoted as [l f r(f)]RCf). Here we simply assume that the generalization [l f ~ ( f ) ] R ( + f ) [I 5 r(f.Lu')]X(J.li) holds approximately.
720
Zhaoping Li
their ocular dominance with motion direction; (3) the directional index IlK& - ~K;ye~~/(~K&,e~ lKgye1) for monocular stimuli can vary with frequency (f."J) of the drifting grating presented, as observed physiologically (Reid et al. 1991), since K&,/Kgyr is not a constant of cf.w). To illustrate the predictions, consider first the example when A:,,, = A:pp = A,,,,= -A-O P P = A (Fig. 5A and C). Then
+
K:(f3
d) =
K:(f.w)
=
*
A[~SUtIl(f. w)
I).
K,ppCf>
(3.6)
A[Ksum(f.iJ)~ K o p p C f , ~=) ]KT(f.w)
(3.7)
Although both the summation and opponency channels are nondirectional, this cell has a directional RF when considering either eye alone since K: > K; and K; > K T , but the preferred direction changes with the eye. In addition, the ocular dominance changes with motion direction since K: > K: (left-dominant) but K, < K; (right-dominant) by equation 3.7, which implies that a direction change is equivalent to an ocular origin change for this cell. Furthermore, the optimal motion speed for the left eye for example is larger in the negative direction v; > v:. This is because the temporal sensitivity curve is a low-pass in the positive direction K:(J) = Ksllm(u) + Kopp(",) but a band-pass in the negative direction K;(J) = Ksulll(d) - Kopp(u) (Fig. 5A), giving wIpeak > uTpeak although the preferred spatial frequenciesf E (f".f"+')are roughly the same for the two directions. Another example (Fig. 5A and B) is when A&,,, = A:ppp= 1 and A&, = AiPp= 0. The ocular summation/opponency, and hence the left/right eye, channels are completely directional. The monocular RFs have sensitivities KTY = Kbum*KKopp by equation 3.5. This cell thus changes its optimal speed with ocular origin just as the cell in the previous example does with motion direction (within a single eye). The predicted ocular differences in preferred speeds and directions of motion have been observed physiologically (Beverley and Regan 1973; Poggio 1992; DeAngelis et d.1994), and such neurons can sense object motion in depth. The predicted changes in the monocular optimal speed with motion direction as well as the ocular dominance changes with direction can be experimentally tested. 4 Summary and Discussion
This paper demonstrates that efficient coding in the multiscale representation can account for many experimental observations of motion and directional sensitivity in simple cells of the primary visual cortex. A whole spectrum of neural directional indices and different degrees of RF spatiotemporal separability are predicted. In addition, the cortical motion coding is predicted to correlate with the codings in space, color, and stereo domain. Explicitly, the theory predicts that the cell preferred speeds decrease with their increasing optimal spatial frequencies, can
Visual Motion Coding in Visual Cortex
721
0.7 -
Hz
Figure 5: Interaction between motion and stereo coding. (A) Temporal senocular opponency Kc)I-,p, sitivity functions for the ocular summation K,,,,, K,, KoPP, and - KOPPl channels for spatial frequency fprnk = 2 cycles/degree, which is used in B and C . Here the binocular correlation used is rjfpeak) = 0.96r-f1''"k/(15 ' / " " ~ ~(B) . An example of different preferred velocities for the two eyes (see text). (C) An example of different preferred directions of motion for the two eyes (see text). It is not difficult to see that the optimal speeds for opposite directions of motion in the same eye are also different, and the ocular dominance changes for this neuron with motion directions.
+
722
Zhaoping Li
differ for the two eyes in the same neuron, are much slower in the colorsensitive channel, and that the two eyes in the same neuron can prefer opposite directions of motion. These predictions agree with physiological or psychophysical observations (Beverley and Regan 1973; Holub and Morton-Gibson 1981; Cavanagh et nl. 1984; Dobkins and Albright 1993; Foster et a / . 1985; Reid rt n l . 1991; Poggio 1992; DeAngelis et nl. 1994). Furthermore, the theory gives testable predictions that have not been experimentally investigated systematically. These predictions are as follows: (1) if two nearby neurons prefer the same optimal spatial frequency, same orientation, and opposite motion directions, and have quadrature RF phase relationship, then they should have the same directional index; (2) a single neuron can have different optimal speeds for opposite directions of motion presented monocularly; and (3) a neuron’s ocular dominance may change with motion direction when opposite directional preferences occur for inputs from different eyes. A special class of predicted neurons by this theory resembles the linear units in the motion models by Adelson and Bergen (1985) and Watson and Ahumada (1985). While these computational models are constructed with the goal of motion or velocity computation within the constraints of physiology and psychophysics, the present theory derives from the efficient coding in the multiscale representation without a priori requiring motion sensing or computation. The efficient coding framework provides the following additional features not present in the previous models: (1) given spatial orientation and scale, a requirement of only m e pnir of phase quadrature filters preferring opposite directions for every f7uo Nyquist sampling intervals in the visual field; (2) a mechanism relating RF properties to input signal powers, leading to additional predictions on the correlation between the motion coding and the spatial, chromatic, and stereo coding. The formulation by equation 2.11 is similar to the model by Hamilton et al. (1989), except that the former is derived from efficient coding principles, while the latter is constructed to fit the experimental data The current theory uses linear approximation for cortical coding mechanisms. The significant cortical nonlinearities, such as those that facilitate the neural responses to preferred motion directions and inhibit the responses to nonpreferred directions, as observed by Reid ef nl. (1991), will lead to quantitative discrepancies between the theory and experiments. However, one can make the following observations. First, this work focuses on efficiency by reducing redundancy between different neural units. Efficiency can be enhanced by using a proper nonlinear transfer function at the single neuron level to achieve maximum information within a limited dynamic range, i.e., histogram equalization, as was done by Laughlin (1981). Such nonlinearity would be within a single cell and may be similar to the action potential generation mechanism. It does not affect the cell’s directional preference and the receptive field significantly, and is still within the goal of efficiency. Experimental evi-
Visual Motion Coding in Visual Cortex
723
dence (Jagadeesh efal. 1993) also suggested that most of the nonlinearity in simple cell motion selectivity originates from the action potential generation. Second, coding efficiency is always with respect to a particular visual environment, characterized by, e.g., signal-to-noise ratio or adaptation levels. To maintain efficiency, environmental changes should lead to coding changes that are necessarily of nonlinear mechanisms, e.g., gain control or normalization (e.g., Heeger 1993), and may involve interactions between output neurons. The current work does not include such nonlinearity because it focuses on what an efficient code should be but not how it is developed or adapted. Third, the RF characterizes only the effective transform from visual inputs to cortical outputs, it does not exclude the possible contribution from the cortical feedback interactions (Douglas and Martin 1992),which could play a significant role in the actual receptive field construction. This argument is apparent in the linear approximation, although the reality is most likely nonlinear. Having said these, one should note that other visual functions beyond efficiency are likely to contribute to the cortical nonlinearity, which cannot be understood by the current framework. A lack of precise knowledge of the natural input power spectrum in the temporal domain makes most theoretical predictions nonquantitative. In any case, the quantitative predictions would also depend on the signalto-noise used in particular experiments. This theory has further considerable limitations. The derivations in Section 2 and the Appendix implicitly assume that there are as many input units (retina ganglion cells) as output units (primary visual cortical cells). Under that assumption, an efficient representation (equation 2.10) should have only one particular parameter set (A:. $ x . $[> permitting only one directional index for all cells and two receptive field phase values in quadrature of each other (at least when considering cells preferring the same orientation and scale). In fact, when spatial and stereo codings are also included (Li and Atick 1994a,b), it then follows that there should be only two (orthogonal) choices of preferred orientation as well as one ocular dominance index and two optimal disparity values for each spatial scale and orientation. In reality, however, there are about 40 times as many cortical cells in V1 as retinal ganglion cells (Barlow 1981) and a spectrum of directional indices, preferred orientations and disparities, and ocular dominance indices in a single cortex (Hubel and Wiesel 1974; Berardi et al. 1982; Berman et al. 1987). The reasons for the cortical cell proliferation and their extent are beyond the scope of the efficient coding theory. However, given a larger cortical cell population compared to that of the retinal ganglion cells, the efficient coding theory can be generalized and still applied. Essentially the same cell RF properties can be obtained, either in a statistical mechanics framework by Nadal and Parga (1993, and Nadal private communication 1995) or nonstatistically (Li, in preparation). Briefly, when the output units are many times more numerous than the sensory input units, efficient coding will
724
Zhauping Li
produce many different copies of the codes like the ones in Section 2. Each copy carries less information than it would if the cell population were smaller, and the representations in different copies are not decorrelated, but the overall representation is still the most efficient given the larger population. However, different copies can have different code parameters (e.g., A:. 0 ' . 4:,. if considering only the motion coding) and can thus generate a whole spectrum of RF properties observed in the cortex. Many of the predictions from the efficient coding framework, such as the cell quadrature phase structures, the spatial frequency bandwidth, the color selective blob cells, and the correlation between spatial and stereo coding, some of which rely heavily on the efficient coding assumption, agree with experimental observations (see Li and Atick 1994~1, b and references therein). In addition, the theoretical framework has already provided predictions that had not been experimentally investigated and have been subsequently confirmed in experiments (Li 1995; Anzai ct 01. 1994). These facts give credibility to efficient coding as a useful framework for understanding a t least some of the primary visual cortical processings. The current work, with some of its predictions not yet experimentally investigated, provides more testing grounds to explore the strength and limitations of the efficient coding framework. In particular, the test on the prediction of neighboring motion sensitive units is crucial to the theory. This is because the confirmation of this prediction requires the neighboring cells to (1) have the same directional index if (2) they are in quadrature phase relationship, (3) have the same optimal spatial frequency and orientation, and (4) prefer the opposite motion directions. To simultaneously satisfy these conditions would be difficult if the neural properties were randomly assigned. Note that conditions (2) and (3) are to reduce or eliminate the probability that the two neighboring cells might be from different efficient copies in the output cell population. This is because different efficient copies are likely to have different orientation preferences and center spatial frequencies, and, in addition, the output neurons in different copies may be anatomically close but have no a priori reasons for a fixed relationship in their RF properties. @) :
Appendix This appendix is to show that equations 2.10-2.14 depict the general spatiotemporal RFs in a multiscale linear efficient code that is translation invariant, temporally causal, and spatiotemporally local (i.e., the RFs have finite and minimum spatiotemporal span to ensure finite synaptic connection length, retinotopy, and minimum delay in information extraction).
Visual Motion Coding in Visual Cortex
725
It was shown (Li and Atick 19944 that the unitary matrix required to combine the nonlocal filters in a spatial frequency band in equation 2.9 is iff, > O
(A.1)
iff, < O for I I = 1 . 2 , . . . . N" and arbitrary 2.4, and 2.5, we have
K:(x.t - t') x
@.
Combining equations A.l, 2.9, 2.3,
lxdwK(f.wj i"
x
C O S ~ ~ (-Xx) ~ ~-
x cos[w(t - t')
7rn/2 + 4
+ ocf.w)]
(A.2)
This is exactly the spatiotemporally separable filters in equation (2.11) when A+ = A-. This RF is the most local spatially as implied by the phase coherence at x = x:, and a finite bandwidth f E (f".f"+'). There is translation invariance K",x. t - t ' ) = -Ki+,[x - (gl+2 - xf,).t - t'] between every second unit, and quadrature RF phase relationship between neighbors Kg and K",,,. These RF similarities give the best translation invariance possible in a scale of more than 1 octave bandwidth in the cortex (Li and Atick 19944. The RF centers $, = ( N / N R ) n[or x,: = (N/N")(rz+ n mod 2)], for n = 1.2. . . . . Nfl, are distributed over the input visual field x E (O,Nj. To obtain the general RF, we note that any changes in spatial phase 4 in equation A.2 and the temporal phase dcf. w)-i 4cf. w) 3 for any d will not compromise efficiency, spatial locality, translation invariance, causality, and the minimum temporal latency and spread. A code of such kind is denoted by K $ ( . I d,/j). Equation 2.9 states that a desired RF has to be a linear combination of the filters Kff of equation 2.5, with w)+ @cf.w ) @ for any p and fi E (f".f"+'). In particular, KP,(. I 4-[I) is so constructed. The desired causality and locality additionally require the general RF to be composed of only causal and most local filters: K i = Ern,@ w(4,P)Ki(. I $>D), where w(4. b') is a weight function. Note that equation A.2 can also be written as
+
+
$(f3
K",x; t - t') c<
imdwK(f.w)
yu
x{cos{ [f(gl -x) -7rn/2] +COS{
with I#I+
=
Ka, =
[f(x", - X) - ~ n / 2-][w(t - t') + 4cf. w ) ]+$-}}
d + P and $1
4
430
4
3
+ [w(t - t ' ) +4cf>w)]+4+}
= I#I - P.
P)K:(. I 4 , P )
Then the general RF
(A.3)
Zhaoping Li
726
is K:
L*dwK(f.d)
cx f"
x { A + c o ~ { i f ( x ~ ~ - x ) - 7 r n / 2[ w ] +( f - f ' ) + m ( f . u ) ] + @ + )
+ A- COS{ [f(xf,
- X)
-~
n / 2-] [&(t - t') + dCf.d)]+ d > F } } (A.4)
with A*&* = C , ,3 w(o.!j)ed*". The best possible translation invariance KE(x, t - t') = (*)KE+,[x - ( x : , + ~- x:,). t - t'] requires the same ( A * . i * ) = (A:.4S) for all the even 11 and ( A * . $ & ) = (A:,gi$) for all the odd units. Hence, the general RFs are within this class of filters and dt'KE(x:, - x. t - f ' ) S ( x .f'), decorrelation between outputs O:, = J_", dx Jr? i.e., (ql(t)Ofn(f'))= btt~b,,,,,,will restrict the choices of the parameters (A:. A.: 4: 4: ). The output correlation is
Hence for f " <
If1
5 f"+'
where (A:,4:) = (A:,@:) or (A:,$:) when FI is even or odd. Since K c f , w)= R-1/2Cf, w),denoting complex conjugate by c.c., we have
(A.61
Visual Motion Coding in Visual Cortex
727
where sgn(f) = 1 or -1 when f > 0 or f < 0, respectively. To continue, we note that U', as given in equation A.l, is a unitary matrix, hence, XI U:I(U&)*= &,,. Then
C
,1[f(~~,-x",)-sgn(f)li(n-nr)i21
f"
c
,G(t-%-.(n-m)/2) + C.C.
c(
6,,
(A.7)
f"
Hence, ~,,,,,p+, e l [ f ( ~ , - ~ ~ , ) - ~ ( n - m ) / 2= ] zp is a pure imaginary number when n # m. Similarly dwe-'w(f-t') = iq is a pure imaginary number when t # t'. By the definition (Li and Atick 1994a) of U', f, = 27rj/N (radiadgrid) with j = ? + l,? + 2. . . . .?+I, No = 2 r f 1 - T). Then '
N"/2-1
0
if n = m + 21 # n, since qI= ( N / N a ) n or (N/N"j(n + n mod 2)
1/2 (up to a normalization constant), if n = m ip otherwise This gives decorrelation ( 0 : , ( t 1 ) O f n ( t 2 ) ) = 0 in equation A.6 for all m = n + 2k # n for any (A*%@).While for n = m, we have the temporal decorrelation: ( O " , ( t 1 ) 0 : , ( f 2 ) ) K J-", dwe'w('1-t2)K S,,,,. When m = n + 2k + 1, we differentiate two situations: (1) t l = t 2 and (2) tl # f 2 . Their respective decorrelation requires {0~l(t)0~l+2,+1(t)) = (AiA:ipei(f-f2)
+ c.c.)
+ ( ~ - ~ - i ~ ~ i ( 4 -+ 4 c,c.) ;) II
=
(@,(f1)0:l+2k+l(t2
# tl))
=
in
-P[A;tA: sin(& - d:) + A,A, sin($, - &)I = 0 0
ix
= ( A , i , ~ ; j ~ dweiw(ti-,2)+l(~~-") +
-
(A,A,ipJomdwe-'w('i-'2)f'(4--~m)
( ~ , tm~ l+11,I(&&) i,i + c,c,) -
(A;A;ipi@(dy-@iz) + c.c)
(A.8) + c,c,) + c.c,j
728
Zhaoping Li
Combining equations A.8 a n d A.9 gives A;fA;l;e'(f-#i) = A;A;e-'('#+-#G) for rn = n + 2k+ 1. Hence we have A$ = ? A t , for 7 = +1, a n d 4: + 4; =
+
4: 4.; It then concludes that the general efficient spatiotemporal code is of the form in equations 2.10-2.14 a n d is determined by five parameters [AI.A;. 4r 4-)/2. 4: (4: - 4F)/Z (4: - 4,)/2].
(af +
Acknowledgments
I wish to thank Edward H. Adelson for very helpful discussions and comments o n the draft, and Ning Qian, Christopher Kolb, Wyeth Bair, Joachim Braun, a n d the two referees for carefully reading the manuscript and very useful comments. Work supported by the Research Grant Council of H o n g Kong.
References Adelson, E. H., and Bergen, J. R. 1985. Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A 2(2), 284-299. Anzai, A,, DeAngelis, G. C., Ohzawa, I., and Freeman, R. D. 1994. Private communication. Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Neiirnl Cottip. 2, 308-320. Also: 1992. What does the retina know about natural scenes? Neiiral Comp. 4, 196-210. Atick, J. J., Li, Z., and Redlich, A. N. 1992. Understanding retinal color coding from first principles. Neitunl Comp. 4, 559-572. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Seirsory Coninz~riiicatioii,W. A. Rosenblith, ed. MIT Press, Cambridge, MA. Barlow, H. B. 1981. The Ferrier lecture, 1980: Critical limiting factors in the design of the eye and visual cortex. Proc. R. Soc. London B 212, 1-34. Berardi, N., Bisti, S., Cattaneo, A., Fiorentini, A., and Maffei, L. 1982. Correlation between the preferred orientation and spatial frequency of neurones in visual areas 17 and 18 of the cat. J. Physiol. 323, 603-618. Berman, N. E. J., Wilkes, M. E., and Payne, B. R. 1987. Organization of orientation and direction selectivity in areas 17 and 18 of cat cerebral cortex. 1. Neiiuophysiol. 58(4), 676-699. Beverley, K. I., and Regan, D. 1973. Evidence for the existence of neural mechanisms selectively sensitive to the direction of movement in space. I. Physiol. 235, 17-29. Bialek, W., Ruderman, D. L., and Zee, A. 1991. Optimal sampling of natural images: A design principle for the visual system? In Adzmiices iiz Nriivnl Informtioii Procrssitig 3 , R. Lippmann, J. Moody, and D. Touretzky, eds., pp. 363-369. Morgan Kaufmann, San Mateo, CA.
Visual Motion Coding in Visual Cortex
729
Burr, D. C. 1991. Human sensitivity to flicker and motion. In Vision and Visual Dysfunction VoL 5: Limits of Vision. J. J. Kulikowski, V. Walsh, and I. J. Murray, eds. Macmillian, New York. Cavanagh, P., Tyler, C. W., and Favreau, 0. E. 1984. Perceived velocity of moving chromatic gratings. J. Opt. SOC.Am. 1, 893-899. DeAngelis, G. C., Ohzawa, I., and Freeman, R. 1994. Neuronal mechanisms underlying stereopsis: How do simple cells in the visual cortex encode binocular disparity? Perception (in press). Dobkins, K. R., and Albright, T. D. 1993. What happens if it changes color when it moves?: Psychophysical experiments on the nature of chromatic input to motion detectors. Vision Res. 33(8), 1019-1036. See also Dobkins, K. R., and Albright, T. D. 1994. What happens if it changes color when it moves?: Neurophysiological experiments on the nature of chromatic input to macaque area MT. J. Neurosci. 14(8),48544870. Dong, D.-W., and Atick, J. J. 1994. Temporal decorrelation: A theory of lagged and nonlagged cells in the lateral geniculate nucleus. Submitted. Douglas, R. J., and Martin, K. A. C. 1992. Exploring cortical microcircuits: A combined anatomical, physiological, and computational approach. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 381412. Academic Press, Orlando, FL. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A 4, 2379-2394. Foster, K. H., Gaska, J. P., Nagler, M., and Pollen, D. A. 1985. Spatial and temporal frequency selectivity of neurones in visual cortical areas V1 and V2 of the Macaque monkey. J. Physiol. 365, 331-363. Grzywacz, N. M., and Yuille, A. L. 1990. A model for the estimate of local image velocity by cells in the visual cortex. Proc. R. SOC.Lond. B 239, 129-161. Hamilton, D. B., Albrecht, D. G., and Geisler, W. S. 1989. Visual cortical receptive fields in monkey and cat: Spatial and temporal phase transfer function. Vision Res. 29, 1285-1308. Heeger, D. J. 1993. Modeling simple cell direction selectivity with normalized, half-squared, linear operators. I. Neurophysiol. 70, 1885-1898. Holub, R. A., and Morton-Gibson, M. 1981. Response of visual cortical neurons of the cats to moving sinusoidal gratings: Response-contrast functions and spatiotemporal interactions. J. Neurophysiol. 46, 1244-1259. Hubel, D. H., and Wiesel, T. N. 1959. Receptive fields of single neurones in the cat’s visual cortex. J. Physiol. Lond. 148, 574-591. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. Lond. 160, 106-154. Hubel, D. H., and Wiesel, T. N. 1974. Uniformity of monkey striate cortex: A parallel relationship between field size, scatter, and magnification factor. J. Comp. Neurol. 158, 295-305. Jagadeesh, B., Wheat, H. S., and Ferster, D. 1993. Linearity of summation of synaptic potentials underlying direction selectivity in simple cells of the cat visual cortex. Science 262, 1901-1904. Kelly, D. H. 1979. Motion and vision 11. Stabilized spatio-temporal threshold surface. 1.Opt. SOC.Am. 69, 1340-1349.
730
Zhaoping Li
Laughlin, S. B. 1981. A simple coding procedure enhances a neuron’s inforniation capacity. Z . Nntiirf. 36c, 910-912. Levinson, E., and Sekuler, R. 197.5. The independence of channels in human vision selective for direction of movement. 1. Physiol. 250, 347-366. Li, Z. 1995. Understanding ocular dominance development from binocular iiiput statistics. In The Neiirohio/og!y of CorJzpirtnfiori (Proceedings of computational neuroscience conference 1994), pp. 397-402. 1. Bower, ed. Kluwer Academic Publishers, Boston, MA. Li, Z., and Atick, J. J. 1994a. Towards a theory of the striate cortex. NrirrnlCorriji. 6, 127-146. Li, Z., and Atick, J. J. 1994b. Efficient stereo coding in the multiscale representation. Nrtzcwk 5, 1-18. Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Adoaiici.s if2 Nezirnl fllfor.J?lntioriPmcessiffg 1, D. Touretzky, ed., pp. 186-194. Morgan Kaufmann, San Mateo, CA. Marr, D., and Ullman, S. 1981. Directional selectivity and its use in early visual processing. Proc. R. SOC.Lord. B 211, 151-180. Nadal, J,-I?, and Parga, N. 1993. Information processing by a perceptron in an unsupervised learning task. Nrtzoork: C o n i p ~ tN. ~ w nSyst. / 4(3), 295-312. Nakayama, K. 1985. Biological image motion processing: A review. Wsiorr Rss. 25(5), 625-660. Poggio, G. F. 1992. Physiological basis of stereoscopic vision. In Visiori o i i d Visirril D!/sfiriictioii. Vol. 9 B i i i n c i h i . Visioii, J. R. Cronly-Dillon and E. Regan, eds., CRC Press, Boca Raton, FL. Reichardt, W. 1961. Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In Sensory Col?lfflflliiCatioJf, W. A. Rosenblity, ed. Wiley, New York. Reid, R. C., Soodak, R. E., and Shapley, R. M. 1991. Directional selectivity and spatiotemporal structure of receptive fields of simple cells in cat striate cortex. I. Nrirrophysiol. 66(6), 505-529. Srinivasan, M. V., Laughlin, S. B., and Dubs, A. 1982. Predictive coding: A fresh view of inhibition in the retina. Plac. R. Soc. Lond. B 216, 427459. Torre, V., and Poggio, T. 1978. A synaptic mechanism possibly underlying directional selectivity to motion. Pmc. R. SOC.Lond. B 202, 409-416. van Santen, J. P. H., and Sperling, G. 1984. A temporal covariance model of human motion perception. J. Opt. Soc. Ail?.A 1,451473. Watson, A. B., and Ahumada, Al. 1985. Model of human visual-motion sensing. J. Opt. SOC.An?.A 2(2), 322-341. Watson, A. B., Thompson, P. G., Murphy, B. J., and Nachmias, J. 1980. Summation and discrimination of gratings moving in opposite directions. Visimi RPS.20, 341-347.
Received February 23, 1995; accepted November 2, 1995.
This article has been cited by: 2. Jeffrey Ng, Anil A. Bharath, Li Zhaoping. 2007. A Survey of Architecture and Function of the Primary Visual Cortex (V1). EURASIP Journal on Advances in Signal Processing 2007, 1-18. [CrossRef] 3. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus]
Communicated by Geoff Hinton
NOTE
Unicycling Helps Your French: Spontaneous Recovery of Associations by Learning Unrelated Tasks Inman Harvey School of Cognitive and Computing Sciences, University of Sussex, Brighton, England James V. Stone School of Biological Sciences, University of Sussex, Brighton, England We demonstrate that imperfect recall of a set of associations can usually be improved by training on a new, unrelated set of associations. This spontaneous recovery of associations is a consequence of the high dimensionality of weight spaces, and is therefore not peculiar to any single type of neural net. Accordingly, this work may have implications for spontaneous recovery of memory in the central nervous system. 1 Introduction A spontaneous recovery effect in connectionist nets was first noted in Hinton and Sejnowski (1986), and analyzed in Hinton and Plaut (1987). A net was first trained on a set of associations, and then its performance on this set was degraded by training on a new set. When retraining was then carried out on a proportion of the original set of associations, performance also improved on the remainder of that set. In this paper a more general effect is demonstrated. A net is first trained on a set of associations, called task A; and then performance on this task is degraded, either by random perturbations of the connection weights, or as a result of learning a new task B. Performance on A is then monitored while the net is trained on another new task C. The main result of this paper is that in most cases performance on the original task A initially improves. The following is a simplistic analogy, which assumes that this effect carries over to human learning of cognitive tasks. If you have a French examination tomorrow, but you have forgotten quite a lot of French, then a short spell of learning some new task, such as unicycling, can be expected to improve your performance in the French examination. Students of French should be warned not to take this fanciful analogy too literally; it requires the implausible assumption that French and unicycling make use of a common subset of neuronal connections. Neural Computation
8, 697-704 (1996)
@ 1996 Massachusetts Institute of Technology
698
Inman Harvey and James V. Stone
We will first give an informal argument to explain the underlying geometric reasons for this effect; follow this with an analysis of how it scales with the dimensionality of weight-space; and then demonstrate it with some examples. 2 High Dimensional Geometry
A number of assumptions will be used here; later we will evaluate their validity. Learning in connectionist models typically involves a succession of small changes to the connection weights between units. This can be interpreted as the movement of a point W in weight-space, the dimensionality of which is the number of weights. For the present, we assume that training on a particular task A moves W in a straight line towards a point A, where A represents the weights of a net that performs perfectly on task A; we also assume that distance from A is monotonically related to decrease in performance on task A. Let A be the position of W after task A has been learned (see Fig. 1). Assume that some ”forgetting” takes place, either through random weight changes, or through some training on a different task, which shifts W to a new point B. The point B lies on the surface of ‘l-l, a hypersphere of radius Y = / A - BI centered on A . We then initiate training on a task C, which is unrelated to task A; under our assumptions, training moves W from B toward a point C, which is distance d = IA - CI from A. If the line connecting B to C passes through the volume of ‘l-l then the distance IW - A / initially decreases as W moves toward C . In such cases, training on task C initially causes improvement in performance on task A. We assume that point A has been chosen from a bounded set of points S, which may have any distribution; that ‘H is centred on A; that B is chosen from a uniform distribution over the surface of ‘H; and that C is chosen from S independently of the positions of A or B. What, then, is the probability that line segment BC passes through ‘l-l? That is, what is the probability that training on task C generates spontaneous recovery on task A? If C lies within ‘H (i.e., if d < r) then recovery is guaranteed. For any point C outside ‘H there is a probability p 2 0.5 of recovery on task A. Figure 1 demonstrates this for a two-dimensional space. The point B may lie anywhere on the circumference of Ff.The line segment BC only fails to pass through ‘l-l if B lies on the smaller arc PQ, where CP and CQ are tangents to the circle, and hence cos(0) = r / d . Thus p 1 0.5, and p + 0.5 as d -+ 00. Consider the extension to a third dimension, while retaining the same values r, d, and 0. The probability q = (1- p ) that BC fails to pass through the sphere ‘H is equal to the proportion of the surface of ‘l-l that lies within
Spontaneous Recovery of Associations
699
...
Figure 1: The circle is a 2D representation of hypersphere H. Initial movement from a point B on the circumference toward C has two possible consequences: trajectory B1 -+ C is outside H,whereas B2 -+ C intersects H.
a cone defined by PCQ with apex C. This proportion is considerably smaller in 3D than it is in 2D. We produce a formula for this proportion for n-dimensions in the next section. We demonstrate analytically what can be seen intuitively, namely that for any given d < ~ 1 2as , n increases q tends to zero. 3 Analysis
Let S ( n ,r. 0) be the surface "area" of the segment of a hypersphere of radius r in n-dimensions, subtended by a (hyper-)cone of half-angle 8; this segment is not a surface area, but rather a surface hypervolume of dimensionality ( n - 1). The surface "area" of the complete hypersphere is S(n.r, T ) . For some constant k,, S(n.r, T ) = k,r"-l. We can use this to calculate S(n.r, 0) by integration:
S(n,r,O)
=
/"='S(n - 1.r sin(a),T) r d a U=O
We require the ratio R,,,Bof S ( n ,r, 0) to S(n.r , T ) .
Inman Harvey and James V. Stone
700
0.500 0.200
0.100 0.050 %I
0
0.020
3 0.010 0.005 0.002 0.001
0.0005
Dimension n
Figure 2: Graph of ratio R,,,sagainst 1 2 . Both axes are logarithmically scaled. Data points are marked by circles on the lines from 0 = 7r/4 on left to 317r/64 on the right.
RIl8 =
16" sin''-2 (
(Y
do dcv
J;* sin"P2( (Y This ratio R,,,B gives the probability that the line segment BC (in Fig. 1, generalized here to n-dimensions) foils to pass through the hypersphere, and is therefore equal to 9. In Figure 2 we plot the ratio R l I ,against ~ the dimensionality 11, for values of 0 from 7r/4 to 317r/64. These values of 6' are associated with corresponding values of d / r (see Fig. 2 ) between 1.41 and 20.4. For a given value of d l r , as the dimensionality n increases, the ratio R,,,otends to zero. For large iz, it is almost certain that the line segment BC passes through the hypersphere IH. This implies that initially the point W moves from B closer to A. Hence performance improves, at least temporarily, on task A. Returning to the assumptions stated earlier, we can now examine their validity. First, an irregular error surface ensures that training does not, in general, move W in a straight line. Second, if perturbation from A to B is achieved by training on a task B then B is chosen from a distribution over Ft which may not be uniform. Third, perfect performance on task C may be associated not with one point C, but with many points that
Spontaneous Recovery of Associations
701
are equivalent in that they each provide a similar mapping from input to output. W may move toward the nearest of many Cs, which is therefore not chosen from Sindependently of A. This may alter the probability that W passes through 3-t. Fourth, if B lies on a hypersphere of dimensionality rn < n then the probability that spontaneous recovery occurs may be reduced. Despite these considerations, evidence of the effects predicted above can be observed. 4 Experimental Results
In two sets of experiments a net was initially trained using backpropagation on a task A. The net had 10 input, 100 hidden, and 10 output units. The hidden and output units had logistic activation functions; weights were initialized to random values in the range [0.3,-0.31. Task A was defined by 100 pairs of binary vectors, which were chosen (without replacement) from the set of 2" vectors. The members of each pair were used to train the net using batch update for 1300 training epochs, with a learning rate 11 = 0.02 and momentum cy = 0.9.' After initial training on task A, the weights were perturbed from A by one of two different methods. 4.1 Experiment 1: Perturbing Weights by New Training. After training on A, the net was trained for 400 epochs on 5 new2 vector pairs to perturb the weights away from A to a point B. Finally, the net was trained on a further 5 new vector pairs (task C) for 50 epochs. During training on C, the performance of the net on task A was monitored. As predicted by the analysis above, performance on task A usually showed a transient improvement as training on task C progressed. This procedure was repeated 380 times using a single A, 20 Bs, and 20 C S . ~Figure 3 shows how many runs improved or declined in performance on task A at the end of each epoch (together with a few runs which showed no change, given the precision used in calculations). A proportion 241 /380 (63.4%) showed incidental relearning on A after the first epoch, but this dropped to less than 50% after the 5th epoch. 4.2 Experiment 2: Perturbing Weights Randomly. After training on task A as above, the weights of the net were perturbed by adding uniform noise. In Experiment 1, the distance IA - BI had a mean of about 7. To make perturbations of comparable magnitude W was perturbed from A to 'This was designed to be comparable to the situation described in Hinton and Plaut (1987). 2Here "new" implies that none of these vectors exists in any previous training set. 3 B and C were chosen without replacement from 20 sets of 5 vector pairs, giving 20 x 19 = 380 different possibilities.
Inman Harvey and James V. Stone
702
Numbers out of 380
nlm
Better at task A
350 300 No change 250
200
50 percent
150 100 50
0 0
20
40
Epochs
Numbers out of 380 mils
Better at task A
350 Worse at task A
300 No change 250
200
50 percent
:-, ,--
150
s-.,
100 50 0 0
20
40
Epochs
FIgure 3 Graphs ot 380 runs, showing numbers improving in performance on A, during 50 epochs of training on C On the top, experiment 1, the weight vector L4 was perturbed from A to B by training on task 23 On the bottom, experiment 2, W was perturbed by a randomly chosen vector of length 7 from 1 t o €3
Spontaneous Recovery of Associations
703
B by adding a random vector of length 7. As described in Experiment 1, this was repeated 380 times. The proportion of the runs that showed incidental relearning is given in Figure 3; this was 248/380 (65.3%)after the first epoch, and remained above 50% for the 50 epochs.
5 Discussion
A new effect has been demonstrated, such that performance on some task A improves initially (from a degraded level of performance), when training is started on an unvelated task C. This has been analyzed in terms of the geometric properties of high-dimensional spaces. In applying this to weight-spaces, we rely on simplistic assumptions about the way training on a task relates to movement through weight-space. The effect can be observed even if these assumptions are violated, as demonstrated by experiment. The graphs show evidence of spontaneous recovery. The effect can be seen to be short-lived in the first case, in which perturbation was achieved by retraining on t3,and sustained in the second case, in which perturbation was random. Only in the latter case can we expect B to have been chosen from an unbiased distribution over the surface of 3-1, unrelated to the position of C. The graphs indicate only the probabilities of improvement, without reference to the magnitudes of these effects in individual runs. This recovery effect may be relevant to counterintuitive phenomena described in Parisi et al. (1992) and Nolfi e f al. (1994), and may also contribute to the effect described in Hinton and Sejnowski (1986). It has been suggested4 that the effect may be related to James-Stein shrinkage (Efron and Morris 1977; James and Stein 1961). That is, reducing the variance of (net) outputs reduces the squared error at the expense of introducing a bias. It may be that training on C incidentally induces shrinkage. The observed effect is weaker than that predicted from the geometric argument given above, presumably due to the simplistic nature of the assumptions used therein. However, the effect is robust, inasmuch as it does not depend on the learning algorithm, nor on the type of net used. For this reason, the effect may have implications for spontaneous recovery of memory in the central nervous system.
Acknowledgments Funding for the authors has been provided by the E.P.S.R.C. and the Joint Council Initiative. We thank the referees for useful comments. 4G. E. Hinton, personal communication.
704
Inman Harvey and James V. Stone
References Efron, B., and Morris, C. 1977. Stein’s paradox in statistics. Sci. Am. 236(5), 119-127. Hinton, G., and Plaut, D. 1987. Using fast weights to deblur old memories. Proc. 7th A m u . Coiif. Cogii. Sci. Sac., Seattle. Hinton, G., and Sejnowski, T. 1986. Learning and relearning in Boltzmann machines. In Pnrnllel Distributed Processiizg: Explorntioiis iii the Microstrzlctiirt. ofCogizifioiz. Voliiine 1: Foirizdntioizs, D. Rumelhart, J. McClelland, and the PDP Research Group, eds., pp. 282-317. MIT Press/Bradford Books, Cambridge, MA. James, W., and Stein, C. 1961. Estimation with quadratic loss. In Proceediiigs of the Fourth Berkeley Syinposizm oiz Mntheiimticnl Stntistics a i d Probnbility 1 , pp. 361-380. University of California Press, Berkeley, CA. Nolfi, S., Elman, J., and Parisi, D. 1994. Learning and evolution in neural networks. Adapt. Behnu. 3(1), 5-28. Parisi, D., Nolfi, S., and Cecconi, F. 1992. Learning, behavior and evolution. In Toziinrd R Prnctice of Autoizoinous Systeiiis: Proceediiigs of the First Enropemi Coiifereizce oil Artificinl Life, F. J. Varela and P. Bourgine, eds., pp. 207-216. MIT Press/Bradford Books, Cambridge, MA.
Received May 15, 1995; accepted November 7 , 1995
This article has been cited by: 1. J. V. Stone, P. E. Jupp. 2007. Free-Lunch Learning: Modeling Spontaneous Recovery of MemoryFree-Lunch Learning: Modeling Spontaneous Recovery of Memory. Neural Computation 19:1, 194-217. [Abstract] [PDF] [PDF Plus] 2. James V. Stone, Nicola M. Hunkin, Angela Hornby. 2001. Neural-network models: Predicting spontaneous recovery of memory. Nature 414:6860, 167-168. [CrossRef] 3. K.W.C. Ku, Man Wai Mak, Wan-Chi Siu. 2000. A study of the Lamarckian evolution of recurrent neural networks. IEEE Transactions on Evolutionary Computation 4:1, 31-42. [CrossRef] 4. Inman Harvey. 1996. Is There Another New Factor in Evolution?Is There Another New Factor in Evolution?. Evolutionary Computation 4:3, 313-329. [Abstract] [PDF] [PDF Plus]
Communicated by Klaus Obermayer
Alignment of Coexisting Cortical Maps in a Motor Control Model Yinong Chen James A. Reggia Department of Computer Science, A. V. Williams Building, University of Maryland, College Park, M D 20742 USA
How do multiple feature maps that coexist in the same region of cerebral cortex align with each other? We hypothesize that such alignment is governed by temporal correlations: features in one map that are temporally correlated with those in another come to occupy the same spatial locations in cortex over time. To examine the feasibility of this hypothesis and to establish some of its detailed implications, we studied a multilayered, closed-loop computational model of primary sensorimotor cortex. A simulated arm moving in three dimensions formed the external environment for the model cortical regions. Coexisting proprioceptive and motor maps formed and generally aligned in a fashion consistent with the temporal correlation hypothesis. For example, in simulated proprioceptive sensory cortex the map of elements responding strongly to stretch of a particular muscle matched the map of tension sensitivity in antagonist muscles. In simulated primary motor cortex the map of elements responding strongly to increased tension in specific muscles matched the map of output elements for the same muscles. These computational results suggest specific experimental measurements that can support or refute the temporal correlation hypothesis for map alignments. 1 Introduction
The function of primary motor cortex (MI) is currently not well understood. Traditionally, research has shown that neurons in MI play an important role in force exertion by individual skeletal muscles (Chang et al. 1947; Asanuma and Rosen 19721, but more recent studies indicate that MI neurons may code for movements of particular directions rather than individual muscles (Georgopoulos et al. 1986). Regardless of the role of neurons in MI, it is generally believed that these neurons make use of feedback information via afferent sensory pathways to carry out motor tasks. In particular, proprioceptive inputs play an important role in the formation of motor cortex outputs. When activation of motor neurons in MI causes contraction of muscles, information on how much each Neural Cornputatiori 8, 731-755 (1996)
@ 1996 Massachusetts Institute of Technology
732
Yinong Chen and James A. Reggia
muscle contracts and how much tension it generates is fed back to the motor cortex through primary somatosensory cortex, forming a ”closed loop.” How this kind of feedback information is processed and used by MI neurons is an important issue in identifying the function of MI. In this paper, we describe a motor control model that simulates map formation both in primary proprioceptive cortex [roughly Brodmann area 3a and some surrounding cortex (Wise and Tanji 1981)l and in MI. The model approximates the closed-loop structure of mammalian motor control systems while remaining computationally tractable. It is based on a simplified arm that moves in 3D space. The arm has three pairs of antagonist muscles or muscle groups receiving motor control information and providing proprioceptive information. The small portion of proprioceptive cortex and motor cortex corresponding to this arm is simulated. Training is done by supplying initial random stimulation to the motor cortex area and allowing the system to reach a stable state in response to each stimulus. There are two motivations for the work described here. First, we wanted to determine whether a closed-loop, multilayer motor control system could self-organize to form cortical feature maps that stnbly represent the characteristics of the simulated arm. In a model such as that described below, urith both local and global sources of feedback, it is not obvious a priori that stable map formation will occur, nor how the resultant maps might relate t o experimental data on mammalian sensorimotor maps. Second, we wanted to understand how the multiple maps simultaneously present in a region of sensorimotor cortex relate to each other. For example, in primary sensory cortex one can ask how the maps of muscle length (stretch) and muscle tension overlap or interrelate. In primary motor cortex, one can ask how both of these sensory maps relate to the motor output maps, and how maps of cortical activation of different muscles interact. I n particular, we examined the following hypothesis: i d w i mirltiplc f ~ ~ f i i ir m c p u i s t iii tlic s o i i w r q i o i i offcortcx,feoticws iri oiiz i i i q ? tliot iire f m p o r n l l y c ~ l z ~ l i 7 iilitli f t ~ ~tliosc, iri iiirotlicr will coiiie to ocC I ~ ~ J t! // w snriic spnfkil locntioizs. Such a hypothesis seems plausible if one assumes that Hebbian learning occurs in cortex. As we show below, this hypothesis is strongly supported by the simulation results, and leads to testable predictions (i.e., it is experimentally refutable). Our model differs from most previous research on motor control systems, which have usually focused on motor control based on visual feedback information. These models usually combine visual input and motor output in a single layer network (Kuperstein 1988; Walter and Schulten 1993; Ritter etnl. 1989). This is not biologically plausible, reflecting the fact that these models are often intended for industrial applications. Some other previously proposed models are more biologically oriented (Georgopoulos ctnl. 1986). However, the model proposed here is different from all of these previous models in that it uses an integrated self-organizing ~i).oi~).iosi.i-’ti711. input feature map aiid motor output map to achieve mo-
Coexisting Cortical Maps
733
tor control tasks. To our knowledge, there has been no previous work explicitly using a proprioceptive feature map in motor control models. We previously studied a model of proprioceptive cortex in isolation (Cho and Reggia 19941, but that model did not involve motor cortex or motor neurons, so the validity of those previous results in a closed-loop system was undetermined. More importantly, the self-organization of motor output maps in MI and their alignment with sensory maps could not be examined at all. Since this earlier study did not involve a closed-loop system, the issue of stable map formation was also not such a critical concern. We found that this closed loop system is capable of self-organizing during unsupervised learning. The maps that arise are consistent with each other and capture the mechanical constraints imposed by the model arm. For example, the sensory cortex map of the tension of a particular muscle group is found to match the sensory cortex map of the length (stretch) of its antagonist muscle. The motor output map that appears possesses some properties seen in mammalian motor cortex, such as a distributed, multifocal representation of individual muscle groups. Thus, although this model is a substantial simplification of the corresponding biological system, it captures some fundamental principles underlying map formation in mammalian motor cortex and makes interesting testable predictions. The rest of this paper is organized as follows. First, the motor control model is described and our experimental methods explained. The main results of simulations with the model are then summarized. Finally, we discuss the validity of the model and the insight it provides into biological motor control systems. 2 Structure and Functionality of the Model
The motor control model described here simulates the closed-loop path of information flow in the nervous system. Activity in MI leads to contraction of muscles that direct arm movements. Proprioceptive information from the muscles is returned to primary sensory cortex, which supplies this information to primary motor cortex, thus influencing the motor output (Fig. 1). Figure 2 shows the model arm in this system. The model arm simulated here is a significant simplification of biological reality. It has an upper arm and a lower arm, connected at the "elbow." There are six generic muscles or muscle groups, with one pair of muscle groups (extensor and flexor) controlling the movement of the lower arm, and two pairs of muscle groups (extensor and flexor, abductor and adductor) controlling the movement of the upper arm, so that the hand is able to move in three-dimensional space. For a particular set of activation values of agonist and antagonist muscles, the corresponding joint is positioned at
Yinong Chen and James A. Reggia
734
Figure I : Schematic diagram of the closed loop motor system: the model arm, directed by motor cortex neuron activity, supplies proprioceptive information to proprioceptive cortex (designated PI). This proprioceptive information then influences neuron activities in the primary motor cortex (MI), thereby changing the motor output commands.
a particular angle. Therefore the length of each muscle is determined. The tension of each muscle is determined by both the activation of the muscle as well as the length of muscle.’ The length and tension of muscles are then transmitted to “proprioceptive cortex” in primary sensory cortex (Brodmann area 3a and nearby regions); we will use the nonstandard label PI for this area. The transformation of information from muscle activations to arm proprioceptive inputs in the model is derived based on arm mechanics and is summarized in the Appendix. Briefly, the difference between the activation of agonist and antagonist muscles determines the joint angle, and hence the length of muscles. The tension of each muscle is determined by both the muscle’s activation (active tension) and length (passive tension). Figure 3 shows the structure of the interconnected neural networks in the model. There are four layers of neural elements. Each element represents a group of neurons with the same functionality (in a cortical layer, each element is analogous to a cortical column). The proprioceptive input layer consists of 12 elements representing the length and tension of
’
Biologically, length information is measured by the receptors in muscle spindles embedded in parallel with muscle fibers. The tension of muscles is measured by receptors in Golgi tendon organs.
Coexisting Cortical Maps
735
elbow
I
Figure 2: Model arm: three pairs of muscles (indicated by the curves) control the movement of upper arm and lower a r m (indicated by the bold line segments). Two pairs of muscles control the upper arm, while the third pair of muscles controls the lower arm.
the 6 muscles. The proprioceptive cortex layer (PI layer) contains 400 elements forming a 20 by 20 two-dimensional, hexagonally tessellated layer, with each element connected to its six neighboring elements. To avoid edge effects, elements on the edges are connected with elements on the opposite edges, forming a torus. The proprioceptive input layer is fully connected to the PI layer. The motor cortex layer (MI layer) has the same size and structure as the PI layer. The PI layer is partially connected to the MI layer, with a coarse topographic ordering. That is, each element in PI is connected to its corresponding element in MI and the surrounding MI elements within a radius of four. This coarse topographic pattern of connectivity is motivated by previous experimental studies that have demonstrated topographic ordering of excitatory connections from primary sensory cortex to MI (Asanuma 1989; Jones et al. 1978; Porter et al. 1990; Yumiya and Ghez 1984).’ However, neither the map formation nor ~~
~
2Feedback connections exist from MI to PI in real cortex (Felleman and Essen 1991; Miyashita et al. 1994; Jones et al. 1978). In our model such ”backward” connections are implicit, as explained in Reggia ef al. (1992, pp. 311-312), and function differently from the forward connections. This difference is motivated by the asymmetric nature of forward and feedback connectivity between cortical regions: forward connections terminate predominantly in layer IV, while backward connections preferentially avoid layer IV (Felleman and Essen 1991; see Reggia et al. 1992 for further discussion). The
Yinong Chen and James A. Reggia
736
Proprioceptive Cortex (Primarily Area 3s)
Primary Motor Cortex
1 ,/m
Fully Connected
Fully Connected
Arm Model
Lower Motor Neurons
Proprioceptive Input
Figure 3: Network architecture of the motor control model: 12 proprioceptive receptor elements form the proprioceptive input layer and are fully connected to the PI layer. The proprioceptive cortex layer PI and primary motor cortex layer MI are two-dimensional arrays of elements with lateral connections. The projection from PI to MI is partial, with a coarse topographic ordering. Each MI element is connected to the six lower motor neuron elements. The transformation of activity in lower motor neurons to proprioceptive input is done by a simulated arm represented by equations A.l to A.5 in the Appendix. the map alignment results described later are critically dependent on this coarse topographic connectivity. The lower motor neuron layer contains six elements representing the activation sent to the six muscle groups from MI. The MI layer is fully connected to the lower motor neuron layer. Weights on all of these interlayer connections are initially random. The transformation of muscle activation into proprioceptive information by the simulated arm effectively connects the lower motor neuron layer and proprioceptive input layer, and completes the closed loop system. In such a closed loop system, the activation of any layer will spread into subsequent layers and in this fashion influences itself. For instance, the activity of elements in the MI layer spreads to lower motor neurons, positions the arm, activates proprioceptive inputs, activates the PI layer, and thus ultimately changes the activation pattern in the MI layer. The activation rule used in this network is intended to produce cortical activation patterns that are similar to those seen biologically. Biological experiments indicate that an excitatory stimulus to cortex produces a same strength/weight governs both a forward connection and its corresponding feedback connection, so the unsupervised learning rule used in the model influences the connections in both directions.
Coexisting Cortical Maps
737
”Mexican Hat” pattern of activity, i.e., a central region of excitation with a surrounding annular region of inhibition, due to lateral interactions in cortex (Hess et 01. 1975; Gilbert 1985). We produced this Mexican Hat pattern of lateral interaction using a competitive model of cortical dynamics (Reggia et al. 1992; Sutton et al. 1994; Cho and Reggia 1994). The specific activation rule used determines the activation level ak(t) of element k with
Here c, < 0 is the decay constant indicating how fast activation decays, M is the maximum value of activation, i n k represents the activation received by element k from other network elements, and extk represents the external input applied to element k during simulation. Constant cp > 0 is an output gain constant, determining the fraction of activation to be output; both parameters p and q influence the degree of peristimulus inhibition (Reggia e f al. 1992). The value a k ( t ) represents the mean firing rate of neurons in element k at time f. During the simulation, the above activation rule (equations 2.1 and 2.2) is applied to all of the interlayer connections as well as the cortical intralayer connections. Update of activation levels occurs in all of the layers simultaneously. Synaptic weights wk, are altered according to an unsupervised Hebblike learning rule (competitive learning): where
a; =
(2.4) otherwise and where ‘1 is a small learning constant. The value o: is a learning threshold and remains fixed throughout training. It ensures that only substantially activated elements can learn. This learning rule is similar to that used in several previous models of cerebral cortex (Kohonen 1989; von der Malsburg 1973; fitter et al. 1989; Grajski and Merzenich 1990; Sutton ef al. 1994; Cho and Reggia 1994). In this model, only the weights of the three sets of interlayer connections are changed by equation 2.3; the corticocortical connections remain constant. Since the learning rule is applied only after the arm position stabilizes, weight changes are driven by the stable points of the dynamics. This simplification was necessary for computational tractability: continuous updating of weights with each iteration of the activation dynamics is computationally impractical with the computing resources used in this work (scientific workstations with code written in C). It is only in this limited sense that this model can be called a motor ”control” system.
Yinong Chen and James A. Reggia
738
3 Experimental Methods
All of the weights in the interlayer connections were initialized randomly, so the initial maps were poorly organized. Training was done by stimulating the MI layer, i.e., by providing activation patches at randomly selected positions in MI. The system was driven by this initial stimulation and the subsequent activation was determined by the activation rule and feedback information via the closed loop system. Without clamping any element’s activation value, the system is able to get sufficient feedback information and no external influence (except the initial stimulation) is exerted on any layer in the system. Learning was conducted after the system achieved stabilized activation levels in all of the layers. All weights were trained at the same time. This training method is motivated by the presumed experiences of an infant exploring space without visual guide. An infant may initiate random activation patterns in motor cortex that result in arm movement. By associating feedback information received from the proprioceptive pathway and the motor commands issued, the cortex is able to self-organize. In this model, the initial stimulation to the MI layer represents input to MI from other, nonmodeled brain areas. More specifically, the training procedure was as follows: 0
0
0
0
0
0
0
0
Step 1: Establish the network, forming a four layer, closed-loop system. Step 2: Randomly initialize connection weights for all interlayer connections between 0.1 and 1.0. Step 3: Apply a patch of activation (radius 1, level 0.03) at a randomly selected position in the MI layer. This patch of activation is retained throughout the learning cycle as external input extk(f), as indicated in equation 2.1. The supplied input activation is combined with feedback activation from PI to jointly determine the activation in the MI layer. Step 4: Propagate the activation in MI to the lower motor neuron layer. Step 5: Compute the resultant joint angles, muscle length, and muscle tension values of the model arm according to the transformation mechanisms described in the Appendix; then use muscle length and tension values as activation values for elements in the proprioceptive input layer. Step 6: Propagate the proprioceptive input layer activation to and within the PI layer. Step 7: Propagate the activation in PI to the MI layer and within the MI layer. Step 8: Repeat Steps 4 through 7 for multiple iterations until the activation levels in each layer stabilize. Stabilization is determined
Coexisting Cortical Maps
0
739
to be a preset number of iterations (1201, which was decided empirically by tracing the activation value for more iterations. Step 9: Use unsupervised learning to train the interlayered connections (equation 2.3). Step 10: Repeat Steps 3 through 9, applying initial patch activation stimulation at different positions in MI, for a preset number of stimuli (2000 patterns).
After training is complete, the trained network is examined to see whether maps have formed and, if so, to characterize them. The parameter values used in equations 2.1-2.4 of the model in producing the results described in the next section are summarized in Tables 1 and 2. The learning threshold, a, is 0 except a = 0.32 from MI to the lower motor neuron layer. Selection of some parameters was motivated by our previous experiences with cortical modeling (Sutton et al. 1994; Cho and Reggia 1994). Other parameters were obtained empirically in preliminary simulations so that three things were true: (1) the activation values of elements in each layer fell within reasonable ranges; (2) intracortical inhibition was sufficient for distinct features to emerge when maps were formed; and (3) a reasonable learning speed was achieved. For example, the relatively large value of q between the proprioceptive inputs and PI, and the large value of p between MI and lower motor neurons, allowed the input stimuli to MI to more quickly influence neurons immediately "downstream" in the closed-loop and more slowly influence more distinct neurons in PI. This was found empirically to lead to much better map formation. Other parameters, such as c,, M, and cy, were set appropriately so that the activation level of elements was mostly between 0 and 1. Although simulation results reported here are based on only one set of parameters, qualitatively similar results may be obtained from a variety of parameter values. In general, a small variation of any of the parameters will produce qualitatively similar results. For example, we found that using all zero learning thresholds gave similar results. More extreme variations of parameters may yield different maps, but maps generated by these variations may still preserve the general properties presented in this paper. For example, the lateral connectivity radius in cortical layers was increased from 1 to 2 or more. While this resulted in larger activation clusters in the cortical layer, the qualitative results presented in the next section still hold. Parameters used here are among those giving the best results we observed, but there is no guarantee that they are optimal. In general, the simulation results reported here are robust. After training, the maps in different cortical layers were examined. These maps included the MI input and output maps, and the PI input and output maps. The measuring of maps was analogous to methods used in biological experiments. Generally, the input maps are measured
Yinong Chen and James A. Reggia
740
Table 1: Parameters Used in Activation Update Rule Parameters CS
M
PI layer -4.0 5.0
MI layer -2.0 3.0
Motor neurons -2.0 1.0
Table 2: Parameters Used in Activation Dispersal Rule and Learning Rule Parameters q P CP
’/
Arm to PI 0.1 1 0.8 0.2
PI to PI 0.0001 1 0.8 NA
PI to MI 0.0001 1 0.6 0.2
MI to MI 0.0001 1 0.4 NA
MI to Motor 0.0001 2 0.05 0.1
by supplying different input stimuli and recording the cortical activations; the output maps are measured by stimulating cortical elements and recording the activations in the lower motor neuron layer. For each kind of map, there are two slightly different ways of showing it. One way is to represent an element with the kind of stimulus to which its response is the strongest. The second way is to show an element with all features that it responds to strongly (above a certain threshold). The first way emphasizes the most prominent feature, while the second way emphasizes multiple prominent features. The nature of a map is more clearly illustrated by using both kinds of map. The convergence of training is determined in two ways. One way is to continue to train the networks for more learning cycles, to see whether further training causes qualitatively different maps. In this fashion it was found that the feature maps no longer reorganized after 2000 learning cycles. Another way is to examine the input-output consistency of the system. This is done by stimulating (also clamping) the elements in the lower motor neuron layer one by one, and recording the corresponding activation pattern in MI. Then, these activation patterns are compared to the MI outgoing weights and to the corresponding elements in the lower motor neuron layer. If they match well, then the system would be selfconsistent. From a computational point of view, this kind of consistency indicates the convergence of training. A property of the unsupervised learning rule we used is that incoming weights will always shift to approximate the input activation patterns that activate a cortical element. Since the MI outgoing weights, which are the incoming weights of the lower motor neuron layer, matched well with the MI activations, the weights should no longer shift as long as the nature of input patterns does not change, and training has essentially been completed. If the system is not self-consistent, the weights would keep following the ac-
Coexisting Cortical Maps
74 1
Table 3: Labeling of Muscle Length and Tension in Illustrations Length (tension) E (4 F (f) B (b) D (d) 0
(0)
c (c)
Muscle Upper arm Extensor Upper arm Flexor Upper arm aBductor Upper arm aDductor Lower arm extensor or Opener Lower arm flexor or Closer
tivation patterns. It has been found that after 2000 learning cycles, the system achieved self-consistency. 4 Results
To summarize the simulation results, the resultant input and output maps are illustrated, and then a comparison between input and output maps is given. The cortical elements that are tuned to multiple sensory features or control multiple muscles are also studied. The symbols used to represent map features are given in Table 3. For input maps, capital letters indicate cortical elements active when the corresponding muscle is stretched (increased length), while lower case letters indicate elements activated by increased tension in the corresponding muscle. For the output map, only capital letters are used to represent muscle contraction/activation. 4.1 Input Map Formation and Characteristics. The measurement of input maps was done by applying 12 different activation patterns to the proprioceptive input layer, in each of which the activation of one of the 12 proprioceptive elements was nonzero, while for all the other elements it was zero. This is analogous to stimulating proprioceptive receptors from a single muscle group and measuring the resulting cortical activities. In this experiment, the input maps in both PI and MI layers were characterized. Figure 4 shows the PI input map, before (left) and after (right) training, using the symbols in Table 3. Each symbol in the map represents the feature to which the element in the corresponding location is most sensitive (i.e., the largest activation value above threshold). Those elements that responded below threshold to all inputs are represented as -. For example, the element in the lower left corner of the PI layer was not tuned above threshold to any specific muscle length or tension before training, but was tuned to upper arm adductor length (D) after training. ‘I
,I
Yinong Chen and James A. Reggia
742
Figure 3: Sensitivity oi PI elements to muscle length before (left) and after (right) training (threshold = 0.4). The numbers in the bottom represent the number of cleinents that are tuned to each of the six muscle length features in the same order as they are listed in Table 3.
b
.. .......EE.........
....E.-.-E-...E-...E . . .E E - - - E E - - - E E - - - E E .................... .................... .................... ....E-...E..-.E....E ...EE.-.EE...EE--.EE
E E .................. E ...E............... ....E.......EE..EE..
. . . . ........E...E...
.................... .. . . . . E E . . . . . . . .
. . E E
. . . . . . . . . . . . . . . . . . . . . ............EE..... E . . . . . . . E . . . . E . . . . . E
.......€...........
.................... ....................
E
.................... . E E ................. .E......E...EE ...... .......EE...E....... . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
......EE.
........ E E -
.................... .-..E....E...-E....E - . .E E - - - E E - - - E E - - - E E ....................
.................... .................... ..-.E.---E...-E...-E . . .E E - - - E E - - - E E - - - E E ................... ...................
. . . . . . . . . . . . . . . . . . . .
..................
4 5 50 47 48 51 47
48 48 48 48 48 48
Figure 5 Tuning of PI elements to the length ot the upper arm extensor before (left) and after (right) training (threshold = 0 4).
Coexisting Cortical Maps
743
The map shown in Figure 4 is difficult to understand. Figure 5 shows the same PI input map in another way, illustrating only those elements (”E” in Fig. 4) tuned sufficiently strongly to the length of the upper arm extensor. Because this kind of figure gives a better indication of the distribution of the responding elements, it is used in the following, as long as there is no qualitative difference between different muscles. From Figure 5, it is clear that after training, elenwrits furred to the 511171~ proprioceptivefeatureformed clusters that aregeizerally uiziforin iiz size arid slzapc, and had ceiiters arraizged in a regular disfributioiz. The maps corresponding to other proprioceptive features showed similar qualities. This kind of regularity indicates that a map has organized in the model PI layer with respect to proprioceptive features. The details of this map vary somewhat depending on the exact display threshold used (0.4 here), but the basic results remain the same. In addition, detailed study of the PI map in isolation shows that although variation in intracortical lateral connection radius, intensity of lateral inhibition, and overall network size affects map details, the same qualitative results still hold (Cho, Jang, and Reggia 1994). Figure 6 shows both the length and tension maps in the PI layer after training. By comparing these maps in the proprioceptive cortex layer, one can see that the length map of a particular muscle matches welliuith the teiisioiz m a p of its anlagonisf muscle. For example, the length map of the upper arm extensor matches the tension map of the upper arm flexor (Fig. 6a and d), and the length map of the upper arm flexor matches the tension map of the upper arm extensor (Fig. 6b and c). This type of relationship between length and tension maps is a result of training, i.e., it is not present prior to training. Since the activation of one muscle (increased tension) causes it to contract, thus stretching its antagonist muscle (increased length), there is a correlation between one muscle’s tension and its antagonist’s length in each input pattern. The maps capture the correlated features of input patterns, reflecting the mechanical constraints imposed by the model arm. Figure 7 shows the length and tension maps in the MI layer after training, measured in the same manner as the PI input maps. The input maps in this layer undergo a transformation, when compared with the input maps in the PI layer. While clusters formed in this layer with a certain degree of regularity during training, it is apparent that the clusters in these posttraining maps are less unform in size, shape, and regularity, compared to the corresponding posttraining PI input maps (Fig. 6). However, the same internal relationships still hold for the MI input map as for the PI input map: the length map of a particular muscle matches well with the tension map of its antagonist muscle. This indicates that the MI layer, although a step further away from the model arm where the mechanical constraints exist, still captures this feature of the input patterns.
Yincing Chen a n d James A. Reggia
744
t
E E E
...EE..-EE.-.EE.-....................
.................... .................... .F..--F.-.-F....F... F F -..FF...FF.--FF---
.................... .................... .................... ....E....E-...E....E .................... ...EE...EE... .................... .................... .F....F....F....F.-. .................... FF---FF-..FF...FF...................... .................... ....E...-E..--L....E .................... ...EE...EE...EE... E E .................... .................... .F...-F-...F....p.-. ....................
FF...FF-..FF..-FF.--
....................
....................
....................
..--E....E....E--.-E
....................
..- E E - - - E E - - - E E - - - E E
.................... ....................
.F....F--.-F.... FF...FF..-FF-..FF--
F - -
..................
.................... 4 8 4 8 4 8 4 9 48 4 8 d
...................
..-.f...-f..--f..-.f
................... _ ~ . . . _ e . . . . e . . . . e . . . ~ c . . . e e . . . e e - . .
e e - - -
.................... .................... ....................
... f
f
-
-
-
f
f
-
-
-
f
f
-
-
-
f
f
.................... .................... .................... ....f...-f....f....
f
. . f f~ - - - f f - - - f f - - - f f . ~ . . . . ~ . - ~ ~ ~ - . . . ~ - . . e e . . . e e . . - e e . . - e e . . ....................
.................... ....................
.................... - e - - - - e - - - -
....................
.................... ....f-...f...-f.-..f . - - f f - . - f f - . . f f - - . f f
~ - . . . ~ - -.................... .
ee...ee...ee...ee..-
....................
....................
.................... .................... -..-f-..-f...-f....f
....................
...f f - - - f f - - - f f - - - f f . ~ - - . . ~ - . - . ~ . . . . .................... ~ . . . ee..-ee~.-eo..-re......................
....................
....................
4 8 4 8 37 48 4 8 48
Figure 6 : PI elements that are tuned above threshold to selected proprioceptive stimuli after training (threshold = 0.4): (a) elements tuned to length of upper arin extensor, (b) elements tuned to length of upper arm flexor, (c) elements tuned t o tension of upper arm extensor, (d) elements tuned to tension of upper arm flexor.
4.2 Motor Output Map Formation and Characteristics. The MI output map was examined after training by stimulating each MI element one by one and seeing which muscle(s) became activated. In determining this map, it proved to be sufficient to examine the weights from MI to the lower motor neuron layer. Figure 8 shows the MI output weight map for the upper arm extensor muscle (El. Each "E" means that the weight
Coexisting Cortical Maps
745
4 1 51 46 5 2 49 47
d
52 42 5 2 51 49 48
Figure 7: MI elements tuned above threshold to selected proprioceptive stimuli after training (threshold = 0.4): (a) elements tuned to length of upper arm extensor, (b) elements tuned to length of upper arm flexor, (c) elements tuned to tension of upper arm extensor, (d) elements tuned to tension of upper arm flexor. from MI to the lower motor neuron controlling the upper arm extensor is above a given threshold. Maps for other muscles show similar features. Comparing Figure 8a and b, it is apparent that clusters formed during training. These clusters are larger and more irregular than those in the PI input map. Some clusters suggest a tendency to form stripes. Although these clusters are not uniform in size and shape, they are similar to the
Yinong Chen and James A. Reggia
746
a.
b
Figure 8: MI output map before (left) and after (right) training for upper arm extensor (threshold = 0.4). activation patterns actually seen in mammalian MI cortex (Donoghue et al. 1992).
4.3 Consistency Between Input and Output Maps. In the previous paragraphs it was shown that the appearance of the MI input map is quite different from the PI input map (compare Fig. 6 and Fig. 7), and the projection from the PI layer to the MI layer is not in the previously expected topographic order. This raises the question of the nature of the relationship of the MI input map to the MI output map. Figure 9a and b shows the MI output maps of the upper arm extensor and flexor, marked by E and F, respectively, based on the MI output weights. Figure 9c and d shows the MI proprioceptive maps with regard to the length and tension of the upper arm extensor, respectively, based on the activation of MI elements (above threshold) when the length or tension feature is present in the proprioceptive input layer. By comparing the MI output and input maps, it is seen that the MI i n p u t inap of a particular muscle’s length matches ziiell uiitk the M I output inap of its aiztagoiiist muscle, uikilr the MI input map ofa particular niirscle’s teiisioiz matches well zilitk the MI output m a p of its corresponding muscle. For example, the MI proprioceptive length map of the upper arm extensor matches well with the MI output map of the upper arm flexor (compare Fig. 9c and b); the MI proprioceptive tension map of the upper arm extensor matches well with the MI output map of the same muscle (compare Fig. 9d with a). The reason for this is that when a muscle is activated in producing a move-
Coexisting Cortical Maps
4 1 5 1 4 6 5 2 49 47
747
52 42 52 51 4 9 4 8
Figure 9: Comparing posttraining MI output maps (threshold= 0.7)of the upper arm extensor (a) and flexor (b) with MI input maps (threshold = 0.4) of length (c) and tension (d) of the upper arm extensor.
ment, it contracts, and its length typically decreases, while its antagonist muscle’s length is increased accordingly. At the same time, the activated muscle is under increased tension. Therefore activation of a muscle typically generates proprioceptive feedback indicating increased stretch of its antagonist muscles, and increased tension of itself. This kind of correlated activation of muscle length and tension feedback is captured by the model and reflected in the maps, such as those in Figure 9.
718
Yinong Chen a n d James A. Reggia
Table 4:Numbers of Implausibly Tuned PI and MI Layer Elements (Threshold = 0.5)
4.4 Elements Tuned to Multiple Features. In both the PI and MI layers, there are elements that became tuned to multiple proprioceptive input features. Some of these tunings are potentially incompatible with the constraints imposed by the mechanics of the inodel arm. For instance, it seems unlikely that a PI element would bc tuned to both a muscle’s length and to its tension together since a muscle does not usually contract (high tension) and lengthen simultaneously in the model arm. Another implausible case woulcl be that a PI element is tuned to the stretch of two antagonist muscles, since they cannot be stretched at the same time. Table 4 shows the number of PI and MI elements tuned to implausible pairs o f inputs before a i d after training. On the tirst line of the table, implausible tuning pairs are given using the symbols in Table 3. For example, label (E,F) indicates that a cortical element is tuned to the length of the upper arm extensor and flexor simultaneously, i.e., that it is activated above threshold when either of these muscles is stretched. Following each label in the sanie column are the numbers of cortical elements that are tuned to the indicated pair of features. The number of PI and MI elements tuned to implausible pairs decreased to 0 during training. This is clear evidence that the model learned the correlations between proprioceptive features arising due to the constraints of the model arm. It should be noted that the plausibility of map features here is predicated on the specific details of the model arm used. Thus, maps in our model would be unable to capture some correlations between muscle tension and length occurring with real movements. The key point here is that the teature maps do not represent implausible relationships for the given arm model. M1 elements that control multiple muscles were also examined. Figure 10 shows the MI elements that have strong connections to multiple muscles after training. At a threshold of 0.4, there were, among 400 elements, 90 elements having strong connections to multiple muscles, and 16 of them controlled 3 muscles. With a higher threshold, the number of multiple control elements decreases. A careful examination of these
Coexisting Cortical Maps
EBO-
BO O C -
BD
BD
.
-
749
. . . . . . . . . . . . FB 80 EO EO . . - - 8 0 . . . . . . . . . €00
-
FB 80 80 . . . . . . . . . EB. . F B BO 00 . . . . . _ . _ EBO EBO FB EBO €0 . . E O B O - . . . . E O B O B O . . _ _ _ _ -B O _ B O F B_ -
F B -
-
.
.
.
.
BO
. . . . . -
-
-
-
-
-
-
-
-
-
-
E. B . . . . . . . . . . BO FB EO EBO FB BO . . . . . . . . . . . . . €0 EO - - - BO BO . . . . EO. . . . . . . . . . . . . EBO . FB E B O B 0 - - . - . - - - . - . - - - FBFBBOEO- - - - - EO- FD- - - BO- - - - - - - - . - 80 FBO - - - - FBO FBO - - . . _ . 00 - - FBO BO 80 - - - B D F B - - - _ . _ 80 B O . . FO 8 0 . . . . . . . . . . .
. .
F B B o . . . . . . . . . . . EBO FB..
.
-
_
-
.
_ .
-
_
.
.
-
-
.
- - . _ EBO EO -
B O B O . .
.
-
.
.
-
-
B U B O 80 .
.
-
.
.
.
. .
€ 0 -
EBO EBO BO 8 0 0 0 . B D - . .
.
.
.
-
-
-
-
.
.
.
8 0 80
.
-
Figure 10: Map of MI elements strongly activating multiple muscles (threshold = 0.4).
multiple controlling elements shows that most of them (85 out of9U) coiitrul muscles acting along different coordinates. For example, as shown in Figure 10, the element in the top row and leftmost column can activate the upper arm extensor (El, upper arm abductor (B), and lower arm extensor (O), each being one of the three pairs of antagonist muscles. This type of element is capable of producing coordinated movement of the arm toward a particular direction, in this specific case toward the upper back part of space, relative to a body-centered coordinate system. This result is consistent with the observation that some neurons in motor cortex code for movement direction. It also provides a testable prediction on the controlling of muscles by motor neurons that could be verified by biological experiments. It should be pointed out that in the mammalian motor system, the control of multiple muscles by individual MI neurons can be implemented via lower brain and spinal circuitry, so our model is by no means analogous to biological systems in terms of actual neural circuitry. This result indicates that, via training, it is possible to produce this type of multiple muscle control in a more general sense from initially random connections.
4.5 Sensitivity to Simultaneous Weight Adaptation. Our model makes the assumption that all three sets of connection weights (sensory neurons to PI, PI to MI, and MI to lower motor neurons) mature simultaneously. While relatively little is known about the precise development of these connections, there is some evidence that the PI to MI connections develop later than the others (Bruce and Tatton 1980), and thus the
750
Yinong Chen and James A. Reggia
developmental assumption in our model should be \Tiewed as only a first approxi mat ion to rea li t y. To examine this issue, we undertook a single simulation with the same parameter values used in the simulation described above, where training was done sequentially. Specifically, we allowed sensory connections to learn first (2000 iterations), then those connections plus MI to lower motor neuron connections to learn (2000 iterations), then all connections to learn (2000 iterations), motivated by data in Bruce and Tatton (1980). We used a smaller learning rate 11 = 0.05 on sensory connections to PI to compensate for its longer total training time. The maps obtained and their alignments were qualitatively similar to those described above, although for one of the six muscle length inputs the alignment with motor outputs was not precise. We believe that a substantial joint learning phase is necessary for complete map alignment to occur. This result, plus the fact that qualitatively similar maps appear in isolated PI when it is trained by randomly positioning the arm (Cho and Reggia 1994), suggests that the results obtained here are not sensitive to the exact developmental order of connection maturation. 5 Discussion
_____
Self-organizing feature maps have become important neural modeling methods over the last several years. They have not only shown great potential in application fields such as motor control, pattern recognition, and optimization (Ritter rf n1. 1989; Kohonen 1989; Angeniol et nl. 19881, but have also provided insight into how the mammalian brain becomes organized (von der Malsburg 1973; Linsker 1986; Grajski and Merzenich 1990; Buriaod et nl. 1992; Weinrich et nl. 1994; Sutton rf nl. 1994). The computational motor control model described here exhibited properties that are consistent with experimental findings involving biological motor control systems. It also provides us with knowledge about the organizing and processing of sensory and motor information along the input-output pathway. Some properties of the model are summarized as follows. First, this model has shown spontaneous emergence of multiple feature maps during unsupervised learning. The model self-organized from initially random connections. These results indicate that, although this model is a significant simplification from reality, it has captured the basic structure and some principles of biological motor control systems. The fact that the model self-organizes into multiple feature maps that are stable in spite of its closed-loop nature suggests that the underlying assumptions (network connectivity, activation dynamics, unsupervised learning, etc.) can account for some important aspects of proprioceptive and motor map formation in mammalian cortex. We believe that these map formation results d o not depend significantly on the specific form of
Coexisting Cortical Maps
751
the activation rule used in the model (equations 2.1 and 2.2), as long as a clear-cut Mexican Hat pattern of lateral interactions occurs in the cortex (Reggia et al. 1992). For example, we have obtained qualitatively similar results when previous cortical map formation experiments using activation rules similar to those used here (Cho and Reggia 1994; Sutton et al. 1994) were reimplemented using more standard activation functions. Second, the maps formed capture the mechanical constraints of the simulated arm. Analysis of the proprioceptive input maps showed that the same elements were tuned to the length/stretch of a particular muscle and to the tension of its antagonist muscle. This is true for both cortical layers. It indicates that cortical elements are capable of recognizing the correlations in the input patterns. This model also showed a consistent relationship between the proprioceptive input maps and motor output maps. It was found that the set of MI elements that controls a particular muscle usually responds to the tension of this muscle and the length of the antagonist muscle. These results are biologically plausible and consistent with the temporal correlation hypothesis presented in the Introduction. Third, the motor output map generated in this model qualitatively resembles the map in mammalian motor cortex. Many experiments have been conducted on mammalian motor cortex, one of which is a systematic mapping of primate forelimb motor cortex (Donoghue ef al. 1992). In that experiment, several major findings indicated that the organization of motor cortex is more complicated than previously thought: 0 Property 1. Neurons representing the same muscle form separated, widely distributed clusters. 0 Property 2. The size and shape of clusters representing the same muscle differ significantly from muscle to muscle, from subject to subject. 0 Property 3. Many neurons in motor cortex can activate multiple muscles. 0 Property 4. No apparent topographic relationships were found in the forelimb area of motor cortex. Properties 1 and 4 are apparent in our computational model. Property 2 is also apparent when comparing the regularity of the proprioceptive input maps in PI with the irregular motor output maps in MI (compare Fig. 6 and Fig. 9a and b). Property 3 emerges in our model via unsupervised learning. As indicated by Figure 10, many MI elements control multiple muscles. Also, by increasing the measurement threshold, the number of MI elements that control multiple muscles decreased. This is also consistent with experimental data showing that stronger stimulation tends to recover more multiple-muscle neurons (Donoghue et al. 1992). Our computational motor control model also provides testable predictions that can be verified or refuted by future biological experiments, as follows. First, to our knowledge, there has been no systematic mapping
7i2
Yinong Chen and James A. Reggia
conducted on mammalian proprioceptive cortex. Thus, the characteristics shown in the cortical proprioceptive input maps, such as regular clusters of elements tuned to the same muscle tension/length, represent testable predictions, although we would not expect as precise regularity as occurs in our simplified model. The relationship between muscle length and tension features is also yet to be verified in biological experiments. Our motor control model also shows that proprioceptive sensory maps formed in tlie MI layer after training. These latter maps exhibit the same properties as seen in the PI layer. On the other hand, the proprioceptive maps in the MI layer, although they capture the same constraints, differ trom the maps in the PI layer in terms of cluster size and shape. Analysis of the model reveals that this is due to the weights on connections from the PI layer to the MI layer. Even though the connections from the PI to the MI layer were initially coarsely topographic, training did not refine this topographic projection, and the resultant weights became complicated and could not be characterized by any simple property. The view that neurons in MI code for tlie force of exertion of individual muscles is controversial. Some of the neurons in MI activate multiple muscles (Donoghue et ill. 1992),suggesting that these neurons might code tor moveinent direction rather than individual muscles (Georgopoulos rt t i / . 1Y86). With our computational model, it was found that most multiplv tuned MI elements controlled muscles in different muscle group pairs and thus their activation tends to move the hand toward a particular direction. This finding is consistent with biological experiments showing that motor neurons tend to project to synergistic muscles (Cheney and Fetz 1983)and tvitli demonstrations that neurons in motor cortex code for m o ~ e m e n directions t (Georgopoulos c't ni. 1986). Experimentally, stimulation o f motor cortex neurons tends to excite one muscle and inhibit its antagonist muscle, thus causing synergistic movenwnts (Cheney and Fetz 1985). Whether the activation of motor cortex neurons activates muscles i n different joints (or the same joint but different movement dimcnsions) is an interesting issue for future biological experiments. It should bt, noted that our computational mociel is built without a priori discrimination between muscles Ivitli rcspect to being antagonists or operating at differing joints; it is the training that distinguishe5 the muscles in differeiit pairs.
Appendix: Deriving Proprioceptive Input from Muscle Activations -~ The transformation converting activation receiwd b y muscles to proprioceptive input is summarized here. First, iicti\xtion of agonist and antagonist muscles determines the joint angle
Coexisting Cortical Maps
753
where the joint angle fl ranges from [-7r/2,7r/2]. Values inag and inant are the input activation to agonist and antagonist muscles, respectively, representing any of the three pairs of muscles. When the joint angle is determined, the length of a pair of antagonist muscles can be derived as
la,
=
I
COST
(:
2 2
- 0)
(A.2)
where I,, and la,, are the length of agonist and antagonist muscles, and I is a constant representing the length of an a ~ p e n d a g e The . ~ tension of each muscle is determined by two factors. The contraction caused by activation of a muscle produces tension, while the stretch of a muscle also produces tension due to the spring-like elasticity of muscle:
where Tapand Tantrepresent the tension of agonist and antagonist muscles, respectively, and T is a passive tension constant. For a more detailed description of this transformation, see Cho and Reggia (1994).
Acknowledgments This work was supported by NIH awards NS29414 and NS16332. J. Reggia is also with the Department of Neurology and the Institute for Advanced Computer Studies. We thank John Donoghue and two reviewers for helpful comments.
References Angeniol, B., de La Croix, V. G., and Le Texier, J. 1988. Self-organizing feature maps and the traveling salesman problem. Nrurnl N e t w r k s 1, 289-294. Asanuma, H. 1989. The Motor Cortex. Raven Press, New York. Asanuma, H., and Rosen, I. 1972. Topographical organization of cortical efferent zones projection to distal forelimb muscles in the monkey. E s p . Brnii~Xi’s. 14, 243-256. Bruce, I. C., and Tatton, W. G. 1980. Sequential output-input maturation of kitten motor cortex. Exp. B m i n Res. 39, 411419. Burnod, Y., Grandguilaume,P., Otto, I., Ferraina, S., Johnson,P. B., and Caminiti, R. 1992. Visuomotor transformations underlying arm movements toward ~~~
?A simplified muscle geometry is used. Each muscle is connected to the middle of each arm appendage, and the muscle is assumed to lie in a straight line.
,34
r)-
Yinong Chen and James A. Reggia
visual targets: A neural network model of cerebral cortical operations. J. iVmrusci. 12(4), 1435-1453. Chang, H., Ruch, C., and Ward, A. A., Jr. 1947. Topographical representation of muscles in motor cortex of monkeys. I . Ncitrophysiiol. 10,39-56. Cheney, P D., and Fetz, E. E. 1985. Comparable patterns of muscle facilitation evoked by individual corticomotoneuronal (cm) cells and by single intracortical microstimuli i n primates: Evidence for functional groups of cm cells. /. NL~lcriop/i!/siol.53, 805-820. Cho, S., and Reggia, J. A. 1994. Map formation in proprioceptive cortex. I i i t . J. Nt’tlrnI S!/sf. 5, 87-101. Cho, S.,Jang, M., and Reggia, J. 1994. Effects of parameter variations on feature maps. I t z f l . Coizf.m i Neirrnl I~lforrrmtiottProcessing, Seoul, Korea, pp. 1301-1306. Donoghue, J. P., Leibovic, S., and Sanes, J. N. 1992. Organization of the forelimb area in squirrel monkey motor cortex: Representation of individual digit, wrist, and elbow muscles. E s p . Brniii Rcs. 89(1), 1-19. Frlleman, D., and Van Essen, D. 1991. Distributed hierarchical processing in primate cerebral cortex. Certlb. Curtt,x 1, 1-47. Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. 1986. Neuronal population coding of movement direction. ScieiicL, 233, 1416-1419. 8, Gilbert, C. 1985. Horizontal integration in the neocortcx. Truids N~~irrosci. 160-163. Grajski, K., and Merzenich, M. 1990. Hebb-type dynamics is sufficient to account for the inverse magnification rule in cortical somatobopy. Ncicrnl Coriip. 2, 71-84. Hess, R., Negishi, K., and Creutzfeldt, 0. 1975. The horizontal spread of intracotical inhibition in the visual cortex. Esp. Brniri Rt,s. 22, 415-419. Jones, E. C., Coulter, J. D., and Hendry, 5.H. 1978. lntracortical connectivity of architectonic fields in the somatic sensory, motor and parietal cortex of monkeys. J . Comp. Nrirrcd. 181, 291-374. Kohonen, T. 1989. Se!f-Or~~zriI;nticiiinizd Associntiw Mcmory, chap. 3, 5. SpringerVerlag, Berlin. Kuperstein, M. 1988. Neural model of adaptive hand-eye coordination for single postures. Srioicc~239, 1308-1 311. Linsker, R. 1986. From basic network principles to neural architectures. Proc. Nntl. ,4rnd. Sri. L I S A . 83, 7508-7512, 8390-8394 and 8779-8783. Miyashita, E., Keller, A,, and Asanuma, H. 1994. Input-output organiration of the rat vibrissal motor cortex. E s p . Bmiii Rcs. 99, 223-232. Porter, L. L., Sakamoto, T., and Asanuma, H. 1990. Morphological and physiological identification of neurons in the cat motor cortex which receive direct input from the somatic sensory cortex. E s p . Brniit Rrs. 80, 209-212. Reggia, J . A., DAutrechy, C. L., Sutton, G. G., and Weinrich, M. 1992. A conipetitive distribution theory of neocortical dynamics. Nc~rrlrlCoitzp. 4, 287-317. liitter, H., Martinetz, T., and Schulten, K. 1989. Topology-conserving maps for learning visuomotor-coordination. Ntwrnl Ntticvc)rks 2, 159-168. Sutton, G. G., Reggia, J. A., DAutrechy, C. L., and Arnmentrout, S. L. 1994. Corticd map reorganization as a competitive process. Nwrnl Coii~p.6, 1-1 3.
Coexisting Cortical Maps
755
von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. Walter, J. A., and Schulten, K. J. 1993. Implementation of self-organizing neural networks for visuo-motor control of an industrial robot. l E E E Trans. Neiwnl Networks 4, 86-95. Weinrich, M., Sutton, G. G., Reggia, J, A,, and DAutrechy, C. L. 1994. Adaptation of noncompetitive and competitive neural networks to focal lesions. 1. Aytificial Neural Networks 1, 51-60. Wise, S., and Tanji, J. 1981. Neuronal responses in sensorimotor cortex to ramp displacement and maintained positions imposed on hindlimb of the unanesthetized monkey. J. Neurophysiol. 45, 482-500. Yumiya, H., and Ghez, C. 1984. Specialized subregions in the cat motor cortex. Exp. Brain Res. 53, 259-276.
Received April 3, 1995; accepted July 21, 1995.
This article has been cited by:
Communicated by Helge Ritter
Controlling the Magnification Factor of Self-organizing Feature Maps H.-U. Bauer Institiit fur Theoretiscke Physik ai7d SFB Nichtlineare Dynanzik, Uniuersitgt Fraizkfiut, Robert-Mayer-Sty, 8-10,60054 Frankfurt/Main 11, Gerniany
R. Der Ilzstitut fur Iizformatik, Uniuersitnt Leipzig, Aiigiistusplatz 10/11,04009 Leipzig,Gevmany
M. Herrmann" Laboratory of Neural Modeling, RIKEN, 2-1 Hirosazua, Wako-shi, 351 -02 Saitama, Japan
The magnification exponents p occurring in adaptive map formation algorithms like Kohonen's self-organizing feature map deviate for the information theoretically optimal value p = 1 as well as from the values that optimize, e.g., the mean square distortion error ( p = 113 for one-dimensional maps). At the same time, models for categorical perception such as the "perceptual magnet" effect, which are based on topographic maps, require negative magnification exponents IL < 0. We present an extension of the self-organizing feature map algorithm, which utilizes adaptive local learning step sizes to actually control the magnification properties of the map. By change of a single parameter, maps with optimal information transfer, with various minimal reconstruction errors, or with an inverted magnification can be generated. Analytic results on this new algorithm are complemented by numerical simulations. 1 Introduction The representation of information in topographic maps is a common property of many regions in the brain, including the visual, auditory, and somatosensory areas of the cortex. Many of these maps are known to be generated or refined by adaptive self-organization processes. The first theoretical description of the self-organization of orientation columns in the primary visual cortex was presented by von der Malsburg (1973); later many more map formation models were introduced. A particularly "Former address: NORDITA, DK-2200 Copenhagen, Denmark.
Neural Computation 8, 757-771 (1996)
@ 1996 Massachusetts Institute of Technology
H.-U. Bauer, R. Der, and M. Herrmanii
7/
38
widespread algorithm is Kohonen’s self-organizing feature map (Kohonen 1995). I t not only has been used to model the formation of maps in different sensory domains (see, e.g., Martinetz et nl. 1988; Obermayer cst a/.7990; Wolf et i7/. 1994), but has also found wide distribution in the technically oriented communities. Here, the self-organizing feature map is often utilized as a neighborhood preserving vector quantizer. A general characteristic of neural maps in brains is the selective magnification of regions of interest. Regions of interest are usually those that are excited most often. Examples include the enlarged representation of the central visual field in visual areas, the enlarged representation of frequencies close to the echolot frequency in the auditory cortex of the bat (Suga 1991), or the enlarged representation of the hand in areas 1 and 3a in somatosensory cortex (Kaas et al. 1981). it has been hypothesized that the selective magnification of maps is adjusted such that each region in the map is excited equally often. The map then transfers the maximum amount of information about the stimulus ensemble. This property is not only approximately observed in biological maps (for retinotopic maps, see, e.g., Wassle ef rtl. 1989), but is also often regarded as a desirable design objective in technical contexts. Maps that result from self-organization processes by and large also show an increased magnification in regions that are often stimulated. The detailed magnification properties of map formation algorithms, however, deserve further investigation. This claim is subtantiated by the following three arguments. First, an analysis by Ritter and Schulten (1986) clarified that for Kohonen’s self-organizing feature map in the one-dimensional case, and in higher-dimensional cases that separate, the relation between the stimulus density P(7cl) and the magnification factor M (7 u ) is governed by an exponent 11 = 2j3,
M ( w ) x P(uj)’
’.
(1.1)
So the self-organizing feature map does not yield a maximum entropy map, which would correspond to 11 = 1 (at least not in those cases that were analytically accessible so far). A similar result holds for the elastic net algorithm (Durbin and Willshaw 1987), where Gruel and Schuster (1994) found an exponent pela, = 0.4 for one-dimensional maps in the limit of soft string tension. Question now arises how neural maximum entropy maps could be self-organized. A second, related, argument concerns the mean distortion error properties of self-organizing feature maps. It has been shown by Zador (1982) that the mean distortion error of a neural map is optimized, if the map has a particular magnification exponent. The value of the exponent depends on the order of the error as well as on the dimension of the map input space. As an example, for one-dimensional maps the mean square error is minimized by an exponent p = 1/3, which deviates from the value 213 inherent to the SOFM-algorithm in this case. The worst case distortion error is optimized for maps (or vector quantizers) that have a
Magnification Factor of Feature Maps
759
flat magnification, /L = 0. In conclusion, an optimization of neural maps with regard to various distortion error measures would require a control of the map magnification exponents. In a third line of argument, we are concerned with modifications of maps that could allow for better categorization. Specifically, it might be appropriate to spent receptive fields at (rarely excited) class boundaries instead of at (often excited) class centers, if a classification (categorization) task has subsequently to be performed. Such an inverted magnification scheme was employed in a recent model (Herrmann et al. 1994) for a phenomenon related to categorical perception (Repp 19841, namely the “perceptual magnet” effect. More details about this effect, and its relation to maps with negative magnification exponents, will be discussed in a subsequent chapter. For the purpose of this introduction it suffices to say that negative magnification exponents might also occur in topographic maps and that existing map self-organization algorithms do not allow for such exponents. In the present article we put forward a simple extension of Kohonen’s self-organizing feature map algorithm that addresses the problems raised by the above arguments. Solely by adaptively adjusting the local learning rate while keeping all other parts of the algorithm unchanged we can control the magnification properties of the resulting maps. The detailed form of the learning rate control, preceded by a brief description of the unmodified version of Kohonen’s self-organizing feature map algorithm (SOFM), is given in the next section. There, we also discuss how the previous analytic results on magnification exponents in SOFMs have to be modified to include the effects of the learning rate control. The third section is devoted to results of simulations of our algorithm, in particular with regard to the map distortions errors. In the fourth section we show results for two-dimensional maps and argue how these could provide a neurobiological basis for categorical perception, as exemplified by the “perceptual magnet“ effect. A discussion, which addresses the comparison of our algorithm to other optimization efforts in the context of competitive learning, as well as the relevance of our model to auditory recognition experiments, concludes the paper.
2 Self-organizing Feature Maps with Node-Dependent Adaptability
As a basis for our magnification control algorithm, we use Kohonen‘s SOFM. Not only can it be regarded as a standard algorithm due to its wide distribution, but in addition there are analytical results on its magnification properties already known (Ritter and Schulten 1986). An SOFM consists of neurons that are located at positions r in an output space grid A, and that have receptive fields with centers w, in an input space V associated with them. A stimulus v E V is mapped onto that neuron
H.-U. Bauer, R. Der, and M. Herrmann
760
s E A, the receptive field center w, of which lies closest to v, s
argmin,Iw, -vl
=
(2.1)
During an adaptation phase, the receptive field center positions w, are adjusted such that the resulting map spans the input space in a topographic fashion. A sequence of random stimuli is presented to the map and the respective best-matching neuron s is determined. Then, the receptive field center of s plus its output space neighbors are shifted toward the stimulus, aw,
=
€hIS(v - w,)
(2.2)
where the property of being an output space neighbor is imposed by the (usually gaussian) neighborhood function k,,, gaussian shape, (2.3) In this way, the topography of the map is ensured, i.e., neighboring neurons in the output space are made to have neighboring receptive fields (the inverse relation-neighboring positions in stimulus space project onto neighboring neurons-does not necessarily hold). A comprehensive treatment of many theoretical and application related aspects of the SOFM can be found in Ritter rf nl. (1992) and Kohonen (1995). A characteristic of neural maps is their areal magnification factor, first introduced by Daniel and Whitteridge (1961). The magnification factor is given by the density M(w) of receptive field centers in the map input space. (In continuum approximation, the position of receptive field centers w varies continously in the input space, as does the position of stimuli v.) Often M(w) is related to the stimulus density P ( w ) via a magnification exponent p,
M( w )
-
P(w)"
(2.4)
An analysis by Ritter and Schulten (1986) showed that such a relation with an exponent = 2/3 also holds for one-dimensional SOFMs in the continuum limit [and, under certain conditions, for maps of higher output dimension (see Section 3.4)]. So one-dimensional SOFMs do magnify regions of high stimulation, but not sufficiently to be informationtheoretically optimal (see introduction). Therefore, and for the other reasons listed in the introduction, we now proceed to modify the standard SOFM to allow for a control of the magnification behavior. To this purpose, we introduce adaptive nodedependent adaptabilities F, (Der and Herrmann 1992) and replace equation 2.2 by the modified learning rule Aw,
= fsh,,(V
-
w,)
(2.5)
Magnification Factor of Feature Maps
761
Further we require the local adaptabilities c, to depend on the stimulus density P at the position of the receptive field center w, associated with r, (Fr)
= F"P(W,)It'
(2.6)
m is a new free parameter of the learning rule that will allow u s to actually control the magnification exponents. Question arises as to how we can enforce relation equation 2.6 during learning, when the stimulus density P(w,) is not known. Here, we can rely on information acquired by the network already, and can exploit the relation
between stimulus density P(w,), receptive field center density M(wr), and the probability P(r) of the neuron at r to be the best-matching node. Assuming independence between sucessive stimuli, we approximate the mean values M(w,) and P(r) by quantities that can be computed at each individual learning step. The receptive field center density M(w,) is inversely proportional to the volume of the respective Voronoi polygons, which in turn, in a d-dimensional input space, are proportional to the dth power of the mean distance /v - w,( between receptive field center and stimulus. The probability P(r) is, on average, related to the time interval A, between successive such events. So we realize equation 2.6 by choosing as a learning step size in one learning step (2.8) with s being the best-matching neuron for this stimulus. Should the data be given in a d-dimensional space, but span only a d,ff-dimensional submanifold, then the effective dimension d,ff will have to be used in equation 2.8. To avoid exceedingly large values for ts(t) that might destabilize the learning process, we also bound the learning step size according to ts(t) 5 tmax= 0.9. It should be noted at this point that the whole modification rests on applying the same learning step size t, associated with the winning neuron s to all weight changes Aw, in the learning step. Had we used the individual tr for the change Aw,, information about the individualized learning steps would not be transferred to the neighboring neurons. No change of the magnification would result, instead each wr would fluctuate on an individual scale about its equilibrium value. How does the changed learning rule, equations 2.5 and 2.6, affect the previously derived magnification exponent p = 2/3? By a calculation analogous to the original derivation by Ritter and Schulten the relation
M ( w ) = P(w)"' =J J ( W ) ~ / ~ ( ' + ~ )
(2.9)
H.-U. Bauer, R. Der, and M. Herrniann
762
for the modified exponent could be established.' In the following sections we will see, in detail, how this relation can be exploited to induce the optimization of various map performance measures by a suitable choice of the control parameter 111. 3 Results of Simulations 3.1 Magnification Exponents in One-Dimensional SOFMs. How does the analytical relation equation 2.9 based on equations 2.5 and 2.6 compare to numerically obtained maps, which have to rely on equation 2.8 approximating equation 2.6? Results for simulations of one-dimensional maps with a linearly increasing stimulus density are depicted in Figure 1; they coincide very well with relation equation 2.6. Also for other stimulus distributions [ P ( z J M ) fi and P(Z))x ZJ'], the numerically obtained exponents were found to coincide very nicely with those given by equation 2.9. 3.2 Information Transmission in SOFMs. As mentioned above, a map with optimal information transmission is characterized by / r = 1, such that the resulting probability P(r) for output nodes r to be excited is a constant across the whole map. To demonstrate that our algorithm can deliver such maps we investigated our numerically obtained maps with regard to their information content N
I = - CP(s)logP(s)
(3.1)
1=1
As shown in Figure 2a, I becomes maximal for expected for a map with magnification exponent
711
0.5, as should be I.
=
jr =
3.3 Distortion Errors in SOFMs. Apart from the maximum transfer of information, an important performance measure for maps used as neighborhood preserving vector quantizers is their mean distortion error
E,
=
Iw,
-
vlpP(v)dv
(3.2)
As was proven by Zador (1982), for a vector quantizer (or neural map in the present context) operating on d-dimensional data points E,, is mini'Corrections to this exponent due to finite size neighborhood widths, or general neighborhood functions, were studied in recent contributions by Ritter (1990) and Dersch and Tavan (1995). Their results have no direct impact on our present arguments. We note, however, that their more general results could be combined with ours, amounting to a multiplication of their magnification expontmts with a factor of (1 + m ) .
Magnification Factor of Feature Maps
2 t'
'
" "
'
"
''
"
'
"
'
"
763
'
" "
'
"
" I
"
'
" " "
'
* '
"
'
"
"j
1: 3
0:
Figure 1: Magnification exponent p as a function of magnification parameter m. At each value of rn, three one-dimensional maps were simulated [0 < u < 1, P ( v ) = 2v, N = 50 neurons]. The resulting exponents pnumare indicated by crosses, the line shows the analytic relation (2.9). mized, if the map obeys equation 2.4 with an exponent
First we note that the unmodified one-dimensional SOFM optimizes the E l l 2 error, a rather exotic error measure. Our magnification control mech-
anism now opens the possibility of optimizing more standard distortion errors, like the mean square error, or the mean linear error. In a onedimensional input space, these should be minimal for p = 1/3 (rn = -0.5) and 1-1 = 1/2 (rn = -0.25), respectively. Figure 2b and c shows that these distortion errors are indeed minimized for these values of m (with a slight deviation of rn = -0.2 instead of rn = -0.25 for the linear error.) In addition, the worst case error,,€, can also be optimized. Minimization of Em,, requires all receptive fields to be of identical size, i.e., p = 0. Figure 2d shows, as can be expected from equation 2.9, that the choice of rn = -1 indeed achieves a minimization of E,,. Analogous simulations showed that the above-mentioned distortion errors of two-dimensional SOFMs are also minimized for the respective values of rn resulting from equations 3.3 and 2.9.
H.-U. Bauer, R. Der, and M. Herrmann
764
d
nn
---
oc'15 0 01 40
U 0 (1 1 30
I
' *
**+ *
.
A -
-2
0
-1
15
1
-10
m
~ 1 5-10
-05 rn
00
0 5
-15
10
-05 m
00
05
-05
00
05
rn
Figure 2: Information content I, mean linear distortion E l , mean square distortion E 2 , and worst case distortion Em,, as a function of the magnification parameter 117 (one-dimensional SOFMs, N = 50 nodes, c = 0.5 t 0.001, (I = 10.0 + 0.1, each cross denotes an average of three maps). As expected from equation 2.9, I is maximized by nz = 0.5 [the maximally possible value of I,,, = &log(& ) = 3.912 for an ideal map is indicated by the dotted line]. E l is minimized for 111 = -0.2, slightly off the theoretical value tiz = -0.25. The minima for E? and Em,, are attained at the theoretical values i n = -0.5 and 111 = -1, respectively.
~~~,
3.4 Inverted Magnification and the "Perceptual Magnet Effect" in Two-Dimensional SOFMs. Finally we investigated the regime of negative magnification exponents p < 0, which we would like to suggest as a possible neurobiological basis for categorical perception, and in particular for the "perceptual magnet" effect observed by Kuhl et d. (1991, 1992). In several psychophysical experiments these authors established (1) that some versions of synthetically generated vowels are perceived to be more typical than others (prototypicality), ( 2 ) that the discrimina-
tion capability for vowel prototypes is smaller than for nonprototypes ("perceptual magnet" effect), and (3) that the position of the prototypes in vowel space depends on the language surroundings in which children grow up (adaptivity). In the latter experiment, the positions of the
Magnification Factor of Feature Maps
765
prototypes were noticably different at an age of 6 months, well before language comprehension. Let us make several assumptions on how this effect could be implemented in a neural system. First, guided by the abundance of topographic organization in all sensory modalities, we assume that a lowlevel representation of sounds perceived in these experiments (vowels) is also based on a topographic map. Second, as a consequence of the abovementioned adaptivity, we assume that the map is self-organized by external stimulation. In other words, we suggest that the ”perceptual magnet” effect ought to be discussed in the framework of map self-organization algorithms. Next, we assume that versions of vowels that are perceived as near-prototypical occur more often in a language environment than versions perceived as nonprototypical. Finally, we assume that different, but similar vowels are easier to distinguish if their representations in the map are further apart. The latter two assumptions are quite reasonable, yet they contain the challenge for the map framework. ”Perceptual magnet” in a map then means that regions of frequent stimulation have to be magnified to a smaller degree than regions of rare stimulation. This corresponds to a negative magnification exponent. In conclusion, the ”perceptual magnet” effect could nicely be interpreted as a map formation phenomenon, provided one can generate maps with inverted magnification ratios. We performed numerical simulations to find this regime in two-dimensional maps. The choice of two-dimensional maps for the representation of vowels is plausible considering that vowels occur as clusters in the two-dimensional space spanned by the two formants (see, e.g., Morgan and Scofield 1991). Yet, our magnification control scheme could operate in output spaces of other dimensionality as well. As stimuli we chose points v = (v,.v,,)in the unit square, i.e., with 0 < v,.v,,< I, which were drawn according to the probability distribution
with v,,o = vy,o = 113, ov = 8. So the stimulus density was a gaussian, located in the left lower center of the unit square, in front of a constant background (see Fig. 3a). In three simulations this input space was mapped onto quadratic output spaces (see Fig. 3b-g). Depending on the exponent rn of the local adaptability, the resulting maps provided a higher resolution of the gaussian peak (Fig. 3b and c), they equilibrated the resolution over the whole input space (Fig. 36 and el, or they decreased the resolution in the region of the peak (Fig. 3f and g). At this point it should be noted that the results on magnification for one-dimensional maps can, under certain conditions, be transferred to two-dimensional maps. The maps have to be organized on a rectangular lattice [which is not to be elongated too much, see Van Velzen (1994)], and
766
H.-U. Bauer, R. Der, and M. Herrmann a 0.003 P(V)
O1
C
0.01
O1
00 02 0 4 0 6 08 1 0 "x
e
0 01
O1
0 0 0.2 0 4 0.6 0.8 1.0 "x
9 0.01
O, 00 0 2 0 4 06 08 10 "x
Figure 3: Maps from the two-dimensional input space 0 5 x, y 5 1 with a stimulus distribution P(v)exhibiting a peak in front of a background (a) onto a grid of 25 x 25 neurons. (b,d,f) The receptive field center distribution of the maps in input space for 111 = 0, -1, -2, respectively. (c,e,g) The corresponding local generalization capabilities, as given by the size of the input space regions, that map onto the respective neuron [i.e., by the inverse of the local receptive field center density M(w).]
the stimulus density has to separate [P(zI,,vy) = P,(u,)P,(v~,)l. Then the magnification properties along the two directions also separate and can be treated as two one-dimensional problems (Ritter and Schulten 1986).
Magnification Factor of Feature Maps
767
4 Discussion
Some of the ideas presented in this paper relate to the problem of leftout codebook vectors, which can occur in adaptive vector quantization algorithms. Here the most common design objective is the minimization of the mean square distortion error EZ. It is a plausible (though not exact, see below) assumption that rarely used codebook vectxs do not sufficiently contribute to the minimization of E z and, therefore, should be brought into play by an equilibration of excitation probabilities. To achieve this heuristic idea, several strategies have been developed, which can be categorized according to two criteria. The first criterion is the observable onto which the localized adaptation is based. Some algorithms rely on an evaluation of the individual excitation probabilities. These include DeSieno’s conscience mechanism (1988) as well as Ahalt ef al.’s frequency-sensitive competitive learning (FSCL, 1990). Others exploit measures for the local reconstruction errors. The latter include Kim and Ra (1995) and Chinrungrueng and Sequin (1995). Our present algorithm relies on measures for both, the excitation probability and the local deviations. A second criterion is the way the equilibration is achieved in the algorithm. In many cases, including DeSieno’s, Ahalt et d ’ s , and Chinrungrueng and Sequin’s algorithms, a weighted distance measure is used, which depends on either the excitation probability (such that often excited nodes get a punishing factor for their distance measure) or the local deviations. Whereas the FSCL mechanism achieves magnification exponents relatively close to p = 1, the approach based on local deviations is considerably worsened by inhomogeneity of the input distribution. Moreover, since a distorted distance measure is not compatible with topology preservation, an implementation of such equilibration for topographic maps should rest in an adaptation of the local learning step sizes. This track is followed in the present paper, as well as (in a different context) in Kim and Ra’s algorithm. Depending on the details of the different implementations, analytic criteria on what is optimized by these different algorithms is often lost. Even though a numerical improvement of performance with regard to E 2 is generally observed, no mathematical reason can be given for this improvement [with the exception of Chinrungrueng and Sequin (199511. Also, an equilibration of excitation probabilities corresponds to a minimization of EZ only in the limit of large input dimensions. In contrast, the magnification exponents p used in the present approach to parameterize the map behavior can rigorously be related to the maximum of information transfer at p = I as well as to the minima of distortion errors E , at p = d / ( d p). A second point we would like to comment on is the possible transfer of our results to other map formation algorithms, like, e.g., the elastic net. For the elastic net, Gruel and Schuster (1994) calculated a magnification exponent of /I = 0.4 in one limiting case (soft string tension). The two
+
7b8
H -L Bnuer, R Der, and M Herrmann
quantities entering our scheme, namely the probability of each neuron to be best-matching, and the local density of RF’s as indicated by tlie degree of match between RF’s and stimulus, can also be evaluated in the elastic net algorithm. Then the learning rate could be adjusted in an analogous fashion, as in the present paper, and the magnification properties of the clastic net could also be made subject of a control. The connection suggested in this paper between feature map selforganization and categorical perception and the “perceptual magnet” effect is also not specific for tlie SOFM. Essential for tlie ”magnet” effect rvas that the representations of two sounds can be distinguished in the map. In the SOFM, with hard competition, two sounds can be distinguished whenever they are mapped onto two different neurons in the map. This seems to be a quite strict model assumption. In a map where the resulting excitation pattern is based on soft competition, the represeiitations o f two sounds could possibly overlap, making a distinction less straightforwwd. However, it is reasonable to assume that also i n such maps tlie discrimination is improved with increasing distance between the centers of the excitation patterns. So our argument about an inverted magnification as the basis for the “perceptual magnets” is not influenced by tlie nature of the lateral competition in the map. Finally, we want to discuss the relevance of our model to the auditory recognition experiments mentioned in Section 4 and to categorical perception. The proposed model owes several deficiencies to its simplicity. Since by the feature map model only one level of auditory perception was picked out the model cannot account for, e.g., preprocessing of stimuli or context effects. Besides, the collective action of those auditory models will have implications to the formulation of a categorization part, e.g., a generalization to maps with more than one center of activation per stimulus or with specific lateral connectivity. Whereas these problems are beyond the scope of the present investigation, we sliould briefly discuss an alternative setup used for classification in artificial neural systems. If a single gaussian unit is provided per prototype vowel its activation will change only slightly close to the prototype, but steeply at the flanks corresponding to nonprototypical stimuli (R. Baddeley, personal communication 1994). Another layer could perform the discrimination relying on the difference in activation of the vowel unit. Generally, the distinction between the SOFM model and a gaussian unit or an effectively similar group of cells is a relative one. The neighborhood function in the Kohonen algorithm also has gaussian shape and can be interpreted as the probability of activation of a neuron. On the other hand, the gaussian units have to become spatially arranged. The latter aspect of the problem, which relates to various experimental evidence (Kuhl 1991; Repp 1984), seemed to us the more interesting one since it can explain artificially produced ”perceptual magnets” using synthetic stimuli, the occurrence of categorial perception in several types of nonspeech stimuli, the possibility to unlearn categorical perception in single auditory modalities, and
Magnification Factor of Feature Maps
769
the degradation from the continuous innate discrimination abilities. Further, the "grandmother cell" idea behind the single unit model renders it useful in technical systems with a well-defined class of prototypes and nonnoisy processing elements. A similar lack of incorporating the virtues of low-dimensional topological features is also present in models using attractor neural networks as a neuronal basis for "perceptual magnets" (Gupta 1993). To summarize, we have considered the virtues of a modified SOFM that allows control of the magnification factor of the resulting map. In this way the range of applicability of such self-organization models with regard to map formation in sensory systems is extended. Our results are also relevant for categorization in neural network models.
Acknowledgments It is a pleasure to acknowledge interesting discussions with Zhao-Ping Li, which brought I'. Kuhl's experiments and the perceptual magnet effect to our attention. The reported results are partially based on work done in the LADY project sponsored by the German Federal Ministry of Research and Technology under Grant 01 IN 106B/3. H.-U.B. gratefully acknowledges support from the DFG through Sonderforschungsbereich 185 Nichtlineare Dynamik, TI' E3. M.H. received support from the EC HCM network Principles of Opfical Computation.
References Ahalt, S. C., Krishnamurthy, A. K., Chen, P., and Melton, D. E. 1990. Competitive learning algorithms for vector quantization. Neural Networks 3,277-290. Chinrungrueng, C., and Sequin, C. H. 1995. Optimal adaptive k-means algorithm with dynamic adjustment of learning rate. IEEE Trans. Neural Networks 6, 157-169. Daniel, P. M., and Whitteridge, D. 1961. The representation of the visual field on the cerebral cortex in Monkeys. I. Physiol. 159, 203-221. Der, R., and Herrmann, M. 1992. Attention based partitioning. In: Status Seminar des BMFT Neuroinformatik, M. van der Meer, ed., pp. 441446, DLR, Berlin. Dersch, D. R., and Tavan, P. 1995. Asymptotic level density in topological feature maps. I E E E T N N 6, 230-236. DeSieno, D. 1988. Adding a conscience to competitive learning. I E E E Int. Conf. Neural Networks, Washington, 1-118-124. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689691.
770
H.-U. Bauer, R. Der, and M. Herrmann
Gruel, J. C., and Schuster, H. G. 1994. Self organizing feature maps: Magnification exponents of elastic net and winner relaxing Kohonen algorithms. N~iir111 Coiiip. (submitted). Gupta, P. 1993. Investigating phonological representations: A modeling agenda. In Prwcerdiiigs uf flit, 1993 Corliiectioi~istMod~.lsSirifiriier School, M. C. Mozer, ed., Lawrence Erlbaum, Hillsdale, NJ. Herrmann, M., Bauer, H.-U., and Der, R. 1994. The "perceptual magnet" effect: A model based on self-organizing feature maps. In Nr~irrnlCoiiipirtotioii niitf Psychol0gy, Stirlirig, 299.1, L. S. Smith and P. J. B. Hancock, eds., pp. 107-116. Springer, London. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. Iiitrotlirctioii to tlic Theor!/ of N~t~irrnl Coriipiitntioii. Addison-Wesley, Redwood City, CA. Kaas, J. H., Nelson, R. J., Sur, M., and Merzenich, M. M. 1981. Organization of somatosensory cortex in primates. In The Orgniiizntiori of the Ccrebrnl Cortrs, F, 0.Schmitt, F. G. Worden, G. Adelman, and S. G. Dennis, eds., pp. 237-261. MIT Press, Cambridge, MA. Kim, Y. K., and Ra, J. B. 1995. Adaptive learning method in self-organizing map for edge preserving vector quantization. I € € € T N N 6 , 278-280. Kohonen, T. 1995. The Self-Orgniiizirig Mnp. Springer, Berlin. Kuhl, P. K. 1991. Human adults and human infants show a "perceptual magnet" effect for the prototypes of speech categories, monkeys do not. Percept. Ps!/~/i~ph!/s.50(2), 93-107. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., and Lindblom, B. 1992. Linguistic experience alters phonetic perception in infants by 6 months of age. Scieiicr 255, 606-608. von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kyberrirtik 14,81-100. Martinetz, T. M., Ritter, H., and Schulten, K. 1988. Kohonen's self-organizing map for modeling the formation of the auditory cortex of a bat. SGAICOProc. "Corir~ectioiiismPerspective," Zurich, 403-412. Morgan, D. P., and Scofield, C. L. 1991. Neirrnl N e f z ~ u r kms i i f S p d i Recogiiitioi~, Kluwer Academic, Boston. Obermayer, K., Ritter, H., and Schulten, K. 1990. A principle for the formation of the spatial structure of cortical feature maps. Proc. Nntl. Acad. Sci. U.S.A. 87, 834558349, Repp, B. 1984. In Speech niid Lnizguage, Adzmces iiz Bnsic Research and Practice, N. Lass, ed., Vol. 10, pp. 224-335. Academic Press, New York. Ritter, H. 1990. Asymptotic level density of a class of vector quantization processes. l E E E T N N 1, 173-175. Ritter, H., and Schulten, K. 1986. On the stationary state of Kohonen's selforganizing sensory mapping. Biol. C!yber!i. 54, 99-106. Ritter, H., Martinetz, T., and Schulten, K. 1992. Neural Compiitatioii and SelfOrganizing Maps. Addison-Wesley, Reading, MA. Suga, N. 1991. Cortical computational maps for auditory imaging. Newrnl Networks 3, 3-22. Van Velzen, G. A. 1994. Instabilities in Kohonen's self-organizing feature map. I. Phys. A 27, 1665-1681.
Magnification Factor of Feature Maps
771
Wassle, H., Griinert, U., Rohrenbeck, J., and Boycott, B. B. 1989. Cortical magnification factor and the ganglion cell density of the primate retina. Nature (London) 341, 643-646. Wolf, F., Bauer, H.-U., and Geisel, T. 1994. Formation of field discontinuities and islands in visual cortical maps. Biol. Cybern. 70, 525-531. Zador, P. L. 1982. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Trans. Inf. Theory 28, 149-159.
Received March 6, 1995; accepted October 17, 1995.
This article has been cited by: 2. C.S. Teh, C.P. Lim. 2006. Monitoring the Formation of Kernel-Based Topographic Maps in a Hybrid SOM-kMER Model. IEEE Transactions on Neural Networks 17:5, 1336-1341. [CrossRef] 3. Thomas Villmann , Jens Christian Claussen . 2006. Magnification Control in Self-Organizing Maps and Neural GasMagnification Control in Self-Organizing Maps and Neural Gas. Neural Computation 18:2, 446-469. [Abstract] [PDF] [PDF Plus] 4. Frank H. Guenther, Jason W. Bohland. 2002. Learning sound categories: A neural model and supporting experiments. Acoustical Science and Technology 23:4, 213-220. [CrossRef] 5. Karin Haese , Geoffrey J. Goodhill . 2001. Auto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature MapsAuto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature Maps. Neural Computation 13:3, 595-619. [Abstract] [PDF] [PDF Plus] 6. Karin Haese . 1999. Kalman Filter Implementation of Self-Organizing Feature MapsKalman Filter Implementation of Self-Organizing Feature Maps. Neural Computation 11:5, 1211-1233. [Abstract] [PDF] [PDF Plus] 7. Frank H. Guenther, Fatima T. Husain, Michael A. Cohen, Barbara G. Shinn-Cunningham. 1999. Effects of categorization and discrimination training on auditory perceptual space. The Journal of the Acoustical Society of America 106:5, 2900. [CrossRef] 8. M.M. Van Hulle. 1999. Density-based clustering with topographic maps. IEEE Transactions on Neural Networks 10:1, 204-207. [CrossRef] 9. Marc M. Van Hulle . 1998. Kernel-Based Equiprobabilistic Topographic Map FormationKernel-Based Equiprobabilistic Topographic Map Formation. Neural Computation 10:7, 1847-1871. [Abstract] [PDF] [PDF Plus] 10. K. Haese. 1998. Self-organizing feature maps with self-adjusting learning parameters. IEEE Transactions on Neural Networks 9:6, 1270-1278. [CrossRef] 11. Juan K. Lin, David G. Grier, Jack D. Cowan. 1997. Faithful Representation of Separable DistributionsFaithful Representation of Separable Distributions. Neural Computation 9:6, 1305-1320. [Abstract] [PDF] [PDF Plus] 12. Lipo Wang. 1997. On competitive learning. IEEE Transactions on Neural Networks 8:5, 1214-1217. [CrossRef] 13. Marc M. Van HulleKernel-Based Topographic Maps: Theory and Applications . [CrossRef]
Communicated by Geoffrey Goodhill
Semilinear Predictability Minimization Produces Well-Known Feature Detectors Jiirgen Schmidhuber Martin Eldracher IDSIA, Corso Elvezia 36,6900 Lugano, Switzerland Bernhard Foltin Fakultat fiir Informatik, TUM, 80290 Miinchen, Germany Predictability minimization (PM-Schmidhuber 1992) exhibits various intuitive and theoretical advantages over many other methods for unsupervised redundancy reduction. So far, however, there have not been any serious practical applications of PM. In this paper, we apply semilinear PM to static real world images and find that without a teacher and without any significant preprocessing, the system automatically learns to generate distributed representations based on well-known feature detectors, such as orientation-sensitive edge detectors and offcenter-on-surround detectors, thus extracting simple features related to those considered useful for image preprocessing and compression. 1 Introduction
Redundancy reduction is widely regarded as an important goal of unsupervised learning (see, e.g., Barlow et al. 1989; Atick et al. 1992; Schmidhuber 1994; compare also Linsker 1988; Foldiak 1990; Deco and Obradovic 1996; Miller 1994; Field 1994). But how is this goal to be achieved in a massively parallel, local, efficient, and perhaps even biologically plausible way? 1.1 Predictability Minimization (PM). The simple approach in this paper is based on the recent principle of predictability minimization (PM) (Schmidhuber 1992). A feedforward network with n output units (or code units) sees input patterns with redundant components. Its goal is to respond with informative but less redundant output patterns, ideally by creating a factorial (statistically nonredundant) code of the input ensemble (Barlow et al. 1992). The central idea of PM is that for each code unit, there is a predictor network that tries to predict the code unit from tke remaining n - 1 code units. But each code unit tries to become as unpredictable as possible, by representing environmental properties that are independent from those represented by other code units. Predictors and code units coevolve by fighting each other (see details in Section 2 ) . NeuvaI Computation 8, 773-786 (1996)
@ 1996 Massachusetts Institute of Technology
774
J. Schmidhuber, M. Eldracher, and B. Foltin
PM has the following potential advantages over other methods: (1)Unlike certain inherently sequential methods (e.g., Rubner and Schulten 1YYO), PM can be implemented in a parallel way. (2) Unlike, e.g., with Barrow’s model (1Y87), there may be iizaizy simultaneously active code units (multiple ”winners” instead of single ”winners”), as long as they represent rfiffermt aspects of the environment (distributed coding instead of local coding). (3) Unlike, e.g., with Linsker’s INFOMAX (1988), there is no need to compute the derivatives of determinants of covariance matrices. (4) Unlike e.g., Deco and Obradovic‘s system (1996), Foldiaks system (19901, Rubner and Tavan‘s system (1989), and antiHebbian systems in general, PM requires neither time-consuming settling phases (due to recurrent connections) nor analytic computation of weight vectors. (5) Unlike almost all other methods, PM has a potential to discover rzoirlinmr redundancy in the input data, and to generate appropriate redundancy-free codes. (6) Unlike most other ”neural” methods (see references above), existing variants of PM create biizary codes as opposed to ioiitiiiiroiis codes. This allows for easier posttraining analysis and facilitates the creation of stntisticdly i)rdcpenduit code components as opposed to merely decorrelnted code components. Note that statistical independence implies decorrelation. But decorrelation docs not imply statistical independence. Why ore statistically iizdeyeizdenf code c o r ~ z p m m t of s iizleresf? A n important reason is that for efficiency reasons, most statistical classifiers (e.g., Bayesian pattern classifiers) assume statistical independence of their input variables (corresponding to the pattern components). If we had a method that takes an arbitrary pattern ensemble and generates ‘in equivalent factorial code, the latter could be fed into an efficient conventional classifier, which in turn could achieve its theoretically optimal performance. 1.2 Purpose of Paper. Despite its potential advantages, PM has been tested on artificial data only (Lindstadt 1993; Schmidhuber 1993, 1994). For a more thorough experimental analysis, in this paper we study the question: what happens if we apply a computationally simple, entirely local, highly parallel, and even biologically plausible variant of PM to real world images? An intuitively reasonable first step toward representing images in a less redundant way (and one adopted by standard image processing techniques, but apparently also by early visual processing btages of biological systems) is to build compact representations based on information about boundaries (edges) between areas with nonvarying, redundant pixel activations. Since PM aims at generating codes with reduced redundancy, we may expect it to discover related ways of coding visual scenes, by creating feature detectors responsive to edges or similar inforniative features in the input scenes. Moreover, since edge detectors (as well as other, related useful feature detectors such as on-center-offsurround detectors) can be implemented with a single layer of neuronal units, we already may expect a single layer system to come up with
Predictability Minimization
775
such detectors. This paper reports a confirmation of this expectation, thus demonstrating that I'M makes sense not only intuitively and in theory, but also in practical applications. The results encourage us to expect that the method also will be beneficial for large-scale applications, by extracting more sophisticated, nonlinear, useful features in deeper layers. Due to our current hardware limitations, however, a test of this hypothesis is left for future research. Section 2 briefly reviews the principles of I'M in more detail. Section 3 applies the technique to real world images and presents results. 2 Predictability Minimization: Details In its simplest form, I'M is based on a feedforward network with I I sigmoid output units (or code units) (see Fig. 1). The ith code unit produces a real-valued output value yy E [O. 11 (the unit interval) in response to the pth external input vector xp (later we will see that training tends to make the ouput values near-binary). There are 11 additional feedforward nets called predictors, each having one output unit and 17 - 1 input units. The predictor for code unit i is called PI. Its real-valued output in response to the {y! : k # i } is called Py. P, is trained (in our experiments by conventional online backprop) to minimize - Y;)2
(2.1)
thus learning to approximate the conditional expectation E(y, 1 (yk : k # i } ) of y,, given the activations of the remaining code units. Of course, this conditional expectation typically will be very different from the actual activations of the code unit. For instance, assume that a certain code unit will be switched on in one-third of all cases within a given context (defined by the activations of the remaining code units), while it will be switched off in two-thirds of all such cases. Then, given this context, the predictor will predict a value of 0.3333. The clue is the code units are trained (in our experiments by online backprop) to maximize essentially the same objective function (Schmidhuber 1992) the predictors try to minimize:
Predictors and code units coevolve by fighting each other. 2.1 Justification. Let us assume that the predictors never get trapped in local minima and always learn the conditional expectations. It then turns out that the objective function Vc is essentially equivalent to the
J. Schmidhuber, M. Eldracher, and B. Foltin
776
CODE UNITS
Figure 1: Predictability minimization (I'M): input patterns with redundant components are coded across IZ code units (gray). Code units are also input units of I I predictor networks. Each predictor (output units black) attempts to predict its code unit (which it cannot see). But each code unit tries to escape the predictions, by representing environmental properties that are independent from those represented by other code units. This encourages high information throughput and redundancy reduction. Predictors and code generating net may have hidden units. In this paper, however, they do not. See text for details.
following one (also given in Schmidhuber 1992):
(2.3)
where y t denotes the mean activation of unit i, and VAR denotes the variance operator. The equivalence of 2.2 and 2.3 was observed by Peter Dayan, Richard Zemel, and Alex Pouget (personal communication, SALK Institute, 1992-see Schmidhuber 1993 for details). Equation 2.3 gives some intuition about what is going on while 2.2 is maximized. Maximizing the first term of 2.3 tends to enforce binary units, and also local maximization of information throughput (given the binary constraint). Maximizing the second (negative) term (or minimizing the corresponding unsigned term) tends to make the conditional expectations equal to the unconditional expectations, thus encouraging mutual statistical independence (zero mutual information) and global maximization of information throughput.
Predictability Minimization
777
3 Application: Image Processing
We apply PM to static black and white images of driving cars (see Fig. 2). Each image is divided into 566 x 702 square pixels. Each pixel can take on 16 different gray levels represented as integers scaled around zero. 3.1 Input Generation. There is a circular "input area." Its diameter is 64 pixel widths. There are 32 code units. For each code unit, there is a "bias input unit" with constant activation 1.0, and a circular receptive field of 81 evenly distributed additional input units. The diameter of each receptive field is 20 pixel widths. Receptive fields partly overlap. The positions of code units and receptive fields relative to the input area are fixed (see Fig. 3). The rotation of the input area is chosen randomly. Its position is chosen randomly within the boundaries of the image. The activation of an input unit is the average gray level value of the closest pixel and the four adjacent pixels (see Fig. 4).
3.2 Learning: Heuristic Simplifications. To achieve extreme computational simplicity (and also biological plausibility), we simplify the general method from Section 2. Heuristic simplifications are as follows: (1) No error signals are propagated through the predictor input units down into the code network. (2) We focus on semilinear networks as opposed to general nonlinear ones (no hidden units within predictors and code generating net-see Fig. 1). (3) Predictors and code units learn simultaneously (also, each code unit sees only part of the total input). These simplifications make the method local in both space and time-to change the weight of any predictor connection, we can use the simple delta rule, which needs to know only the current activations of the two connected units, and the current activation of the unit to be predicted: in response to input pattern xi', each weight w of predictor P, changes according to
where r/r is a positive constant. Likewise, to change the weight of any connection to a code unit, we need to know only the current activations of the two connected units, and the current activation of the corresponding predictor output unit: in response to input pattern xi', each weight z1 leading to code unit i changes according to
where rlc is a positive constant.
778
J. Schmidliuber, M. Eldracher, and B. Foltin
Figure 2: A typical image from the image data base.
3.3 Measuring Information Throughput. Unsupervised learning is occasionally switched off. Then the number N of pairwise different output patterns in response to -3000randomly generated input patterns is determined (the activation oi each output unit is taken to be 0 if below 0.05, 1 if above 0.95, and 0.5 otherwise). The wcscss rate is defined by N:5000. Clearly, a success rate close to 1.0 implies high information tliroughpu t.
3.4 Results. Figure 5 plots success rate against number of training pattern presentations. Results are shown for various pairs of predictor learning rates and code unit learning rates rlc. For instance, with close to 1.0 and rjC being one or two orders of magnitude smaller, high success rates are obtained. Although the learning rates d o have an influence on learning speed, the basic shapes of the learning curves are similar.
Predictability Minimization
779
Figure 3: Small circles represent partly overlapping receptive fields of code units. Their positions are shown relative to the input area (gray). See Figure 4 for details. 3.5 Edge Detectors. With the set-up described above, to maximize information throughput, the system tends to create orientation-sensitive edge detectors in an unsupervised manner. Weights corresponding to a typical receptive field (after 5000 pattern presentations) are shown in Figure 6. The connections are divided into two groups, one with inhibitory connections and the other one with excitatory connections. Both groups are separated by a "fuzzy" axis through the center of the receptive field. Its rotation angle determines the alignment of the edge provoking maximal response. In general, receptive fields of different code units exhibit different rotation angles (see Fig. 7). Obviously, to represent the inputs in an informative but compact, efficient, redundancy-poor way, the creation of feature detectors specializing on certain rotated edges proves useful.
780
J. SchmidhLiber, M Eldraclier, and B. Foltin
00000
,o
Figure 3: Details of a receptive field inside the input area. Small circles indicate pixel positions. Crosses indicate positions of rotated input units. The activation of each input unit is the average gray level value of the closest pixel and the tour adjacent pixels (indicated by dotted lines).
It is noticeable that in Figure 6 there appears to be a smooth gradient of weight strengths from strongly positive to strongly iiegative as one inoves perpendicular to the positive/ negative border. This is different from, e.g., MacKay and Miller (1990), where all weights tend to zero at the edge of the receptive field. At the moment, we d o not have a good explanation for this effect. 3.6 On-Center-Off-Surround/Off-Center-On-Surround. The nature of the receptive fields partly depends on receptive field size and degree of overlap. For instance, with nearly 200 input units per field and a more symmetric arrangement of receptive field centers (essentially, on
Predictability Minimization
781
0.9 --
0.8
..
0.7 .0.6 -.
0.5
.-
0.4
--
0.3 ..
0.2 --
0
1
2
3
4
5
6
Figure 5: Success rate (measuringinformation throughput) plotted against number of training pattern presentations (logarithmicscale). Results are shown for various pairs of predictor learning rates rip and code unit learning rates f/c. (a) i p = 0.001, 7lc = 0.00004;(b) rip = 0.01, rlc = 0.00011; (c) r)p = 0.1, qc = 0.005; (d) 7/p = 1.0, I/C = 0.0042.
the circular boundary of each field there are 6 other field centers), the system tends to generate another well-known kind of feature detector: the weight patterns become either on-center-off-surround-like or off-centeron-surround-like structures (see Fig. 8 for an example). 4 Discussion
We do not claim that PM is the only parallel method (as opposed to sequential methods, eg., Rubner and Schulten 1990) that can lead to well-known feature detectors. For instance, in case of gaussian input distributions, Linsker’s linear approach (Linsker 1986a,b) for single output units also generates certain kinds of orientation-sensitive fields (see also MacKay and Miller 1990). This holds for more structured input
782
J. Schmidhuber, M. Eldracher, and B. Foltin
Figure 6 : Weights corresponding to a typical receptive field (after 5000 pattern presentations). Bright (dark) circles represent positive (negative) weights. The connections are divided into two groups, one with inhibitory connections and the other one with excitatorv connections. Both groups are separated by a “fuzzv axis” (black division line) through the center of the receptive field. Its rotation angle determines the alignment of the edge provoking maximal response.
data as well (Linsker, personal communication, 1994). In case of multiple code units, however, to prevent different code units from representing the same information, Linsker’s INFOMAX approach (1988) requires computing the derivatives of determinants of covariance matrices, which is computationally expensive and also biologically implausible [the multiple cell approach presented in Linsker (1986b) does not have certain problems of his later infomax approach, but it also does not have that nice theoretical foundation]. Finally, it is conceivable that Foldiak’s sys-
Predictability Minimization
783
Figure 7 For all receptive fields, typical posttraining boundaries between inhibitory and excitatory weights are shown. Distances between field centers are "blown up" to avoid confusion caused by overlaps. tem (1990), Rubner and Tavan's system (19891, and Deco and Obradovic's system (1996) might come up with similar edge detectors when applied to real world images. Unlike these approaches (and unlike other similar systems), however, our feedforward net requires neither time-consuming settling phases (due to recurrent "anti-Hebbian" connections for lateral inhibition) nor analytic computation of the weight vectors. 4.1 Future Research. We implemented a hierarchy of processing stages, each consisting of code modules and predictors as above. Each stage computes the input to the next stage. Preliminary tests again led to feature detectors causing high information throughput. However, unlike receptive fields of feature detectors observed in the first layer, receptive
784
J. Schmidhuber, M. Eldracher, and B. Foltin
Figure 8: Bright (dark) circles represent positive (negative)weights. With nearly 200 input units per field and a symmetric arrangement of receptive field centers (essentially,on the circular boundary of each field there are 6 other field centers), thc weight patterns generated by the system tend to be either off-center-onsurround-like (see figure) or on-center-off-surround-like.
fields in higher layers appeared rather complex and did riot exhibit any obvious structure. We would like to test the system on large data sets of real world scenes. We expect that this will lead to successively more complex and more specialized feature detectors, hopefully not only potentially useful for technical applications but also qualitatively related to those observed in biological systems. Another interesting experiment will be to add neighborhood relationships between the code units, to see whether this leads to automatic development of a smoothly varying map of orientation preference. Unfortunately, however, our current hardware equipment does not permit large-scale applications.
Predictability Minimization
785
Acknowledgments Thanks to Ralf Linsker, Gustavo Deco, a n d Peter Dayan for valuable comments.
References Atick, J. J., Li, Z., and Redlich, A. N. 1992. Understanding retinal color coding from first principles. Neural Comp. 4, 559-572. Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. 1989. Finding minimum entropy codes. Neural Comp. 1(3), 412423. Barrow, H. G. 1987. Learning receptive fields. Proc. I E E E 1st A n n u . Conf. Neural Networks IV, 115-121. Deco, G., and Obradovic, L. 1996. An Information-Theoretic Approach lo Neural Computing. Springer, NY. Field, D. J. 1994. What is the goal of sensory coding? Neural Coinp., 6, 559-601. Foldiak, P. 1990. Forming sparse representations by local anti-Hebbian learning. Bid. Cybern. 64, 165-170. Lindstadt, S. 1993. Comparison of two unsupervised neural network models for redundancy reduction. In Proceedings of the 1993 Coniiectionisf Models Summer Sclzool, M. C . Mozer, P. Smolensky, D. S.Touretzky, J. L. Elman, and A. S. Weigend, eds., pp. 308-315. Erlbaum Associates, Hillsdale, NJ. Linsker, R. 1986a. From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. U.S.A. 83,8779-8783. Linsker, R. 1986b. From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 8390-8394. Linsker, R. 1988. Self-organization in a perceptual network. I € E € Coniput. 21, 105-1 17. MacKay, D. J. C., and Miller, K. D. 1990. Analysis of Linsker’s simulation of Hebbian rules. Neural Comp. 2, 173-187. Miller, K. D. 1994. A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between on- and off-center inputs. J. Nrurosci. 140 ), 409441. Rubner, J., and Schulten, K. 1990. Development of feature detectors by selforganization: A network model. Bid. Cyberiz. 62, 193-199. Rubner, J,, and Tavan, P. 1989. A self-organization network for principalcomponent analysis. Europlzys. Lett. 10, 693-698. Schmidhuber, J. H. 1992. Learning factorial codes by predictability minimization. Neural Coiizp. 4(6), 863-879. Schmidhuber, J. H. 1993. Netzwerkarchitekturen, Zielfunktionen und Kettenregel. Habilitationsschrift, Institut fur Informatik, Technische Universitat Miinchen. Schmidhuber, J. H. 1994. Neural predictors for detecting and removing redundant information. In Adaptive Behaviur and Learning, H. Cruse, J. Dean, and
786
J . Schmidhuber, M. Eldracher, and B. Foltin H. Ritter, eds., number 9, pp. 135-145. Center for Interdisciplinary Research, C'niversitat Bielefeld. -
~.
Rt.ce~veiilanuarb 3, 1995, accepted August 29, 1995
This article has been cited by: 2. Alexander S. Klyubin, Daniel Polani, Chrystopher L. Nehaniv. 2007. Representations of Space and Time in the Maximization of Information Flow in the Perception-Action LoopRepresentations of Space and Time in the Maximization of Information Flow in the Perception-Action Loop. Neural Computation 19:9, 2387-2432. [Abstract] [PDF] [PDF Plus] 3. N.N. Schraudolph. 2004. Gradient-Based Manipulation of Nonparametric Entropy Estimates. IEEE Transactions on Neural Networks 15:4, 828-837. [CrossRef] 4. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 5. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 6. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 7. Robert L. Goldstone. 1998. PERCEPTUAL LEARNING. Annual Review of Psychology 49:1, 585-612. [CrossRef]
Communicated by Alan Yuille
Learning with Preknowledge: Clustering with Point and Graph Matching Distance Measures Steven Gold Department of Computer Science, Yale University, New Haven, CT 06520-8285 USA
Anand Rangarajan Department of Diagnostic Radiology, Yale University, New Haven, CT 06520-8042 USA
Eric Mjolsness Department of Computer Science and Engineering, University of California at Sun Diego (UCSD), La rolla, C A 92093-0114 USA
Prior knowledge constraints are imposed upon a learning problem in the form of distance measures. Prototypical 2D point sets and graphs are learned by clustering with point-matching and graph-matching distance measures. The point-matching distance measure is approximately invariant under affine transformations-translation, rotation, scale, and shear-and permutations. It operates between noisy images with missing and spurious points. The graph-matching distance measure operates on weighted graphs and is invariant under permutations. Learning is formulated as an optimization problem. Large objectives so formulated (- million variables) are efficiently minimized using a combination of optimization techniques-softassign, algebraic transformations, clocked objectives, and deterministic annealing. 1 Introduction
While few biologists today would subscribe to Locke’s description of the nascent mind as a tabula rasa, the nature of the inherent constraintsKant’s preknowledge-that helps organize our perceptions remains much in doubt. Recently, the importance of such preknowledge for learning has been convincingly argued from a statistical framework (Geman et ul. 1992). Several researchers have proposed that our minds may incorporate preknowledge in the form of distance measures (Shepard 1989; Bienenstock and Doursat 1991). The neural network community has begun to Neural Computation 8, 787-804 (1996)
@ 1996 Massachusetts Institute of Technology
788
Steven Cold, Anand Rangnrajan, and Eric M p l ~ i i e ~ ~
explore this idea via tangent distance (Simard t’t 01. 1993) and model learning (Williams r t n / . 1993). However, neither of these distance measures has been invariant under permutation of the labeling of the feature points or nodes. Permutation-iiivariaiit distance measures must solve the correspondence problem, a computationally intractable problem fundamental to object recognition systems (Grimson 1990). Such distance measures may be better suited for the learning of the higher level, more complex representations needed for cognition. In this work, we introduce the use of more powerful, permutation-invariant distance measures in learning. The unsupervised learning of object prototypes from collections of noisy 2D point-sets or noisy weighted graphs is achieved by clustering with point-matching and graph-matching distance measures. The pointmatching measure is approximately invariant under perniutations and affine transformations (separately decomposed into translation, rotation, scale, and shear) and operates on point-sets xvith missing or spurious points. The graph-matching measure is invariant under permutations. These distance measures and others like them may be constructed using Bayesian inference on a probabilistic model of the visual domain. Such models introduce a carefully designed bias into our learning, which reduces its generality outside the problem domain but increases its ability to generalize within the problem domain. From a statistical viewpoint, outside the problem domain it increases bias while within the problem domain it decreases variance. The resulting distance measures arc similar to some of those hypothesized for cognition. The distance measures and learning problem (clustering) are formulated as objective functions. Fast minimization of these objectives is achieved by a combination of optimization tecliniclues-softassign, algebraic transformations, clocked objectives, and deterministic annealing. Combining these techniques significantly increases the size of problems that may be solved with recurrent network architectures (Rangarajan rt a / . 1996). Even on single-processor workstations, nonlinear objectives with a million variables can be minimized relatively quickly (a few hours). With these methods we learn prototypical examples of 2D point-sets and graphs from randomly generated experimental data. 2 Relationship to Previous Clustering Methods
Clustering algorithms may be classified as central or pairwise (Buhmann and Hofmann 1994). Central clustering algorithms generally use a distance measure, like Euclidean or Mahalonobis, that operates on feature vectors within a pattern matrix (Jain and Dubes 1988; Duda and Hart 1973). These algorithms calculate cluster centers (pattern prototypes) and compute the distances between patterns within a cluster and the cluster center he., pattern-cluster center distances). Pairwise clustering algorithms, in contrast, may use only the distances between patterns and
Learning with Preknowledge
789
may operate on a proximity matrix (a precomputed matrix containing all the distances between every pair of patterns). Pairwise clustering algorithms need not produce a cluster center and do not have to recalculate distance measures during the algorithm. We introduce central clustering algorithms that employ higher-level distance measures. In the few cases where higher-level distance measures have been used in clustering (Kurita et d. 1994) they have all, to our knowledge, been employed in pairwise clustering algorithms, which used precomputed proximity matrices and did not calculate prototypes. Consequently, while classification was learned the exemplars were not. As is the case for central clustering algorithms, the algorithm employed here tries to minimize the cluster center-cluster member distances. However, because it uses complex distance measures it has an outer and inner loop. The outer loop uses the current values of the cluster centercluster member distances to recompute assignments (reclassify). After reclassification, the inner loop recomputes the distance measures. The outer loop is similar to several other algorithms employing mean field approximations for clustering (Rose ef al. 1990; Buhmann and Kuhnel 1993). It is also similar to fuzzy ISODATA clustering (Duda and Hart 1973), with annealing on the fuzziness parameter. The clustering algorithm used here is formulated as a combinatorial optimization problem, however, it may also be related to parameter estimation of mixture models using the maximum likelihood method (Duda and Hart 1973) and the expectation-maximization (EM) algorithm (Dempster ef al. 1977; Hathaway 1986). The inner loop uses the newly discovered distance measures for point (Gold ef al. 1995) and graph matching. In the following we will first describe these new distance measures and then show how they are incorporated in the rest of the algorithm.
3 Formulation of the Objective Functions
3.1 An Affine Invariant Point-Matching Distance Measure. The first distance measure quantifies the degree of dissimilarity between two unlabeled 2D point images, irrespective of bounded affine transformations, i.e., differences in position, orientation, scale, and shear. The two images may have a different number of points. The measure is calculated with an objective that can be used to find correspondence and pose for unlabeled feature matching in vision. Given two sets of points {X,} and {Yk}, one can minimize the following objective to find the affine transformation and permutation that best maps some points of X onto some points of Y
Steven Gold, Anand Rangarajan, and Eric Mjolsness
790
A is the affine transformation, which is decomposed into scale, rotation, and two components of shear as follows:
where
SIl-iC)
=
i
coshic~ sinhic) sinh(c) coshici
Ri(-)) is tlie standard 2 x 2 rotation matrix. g ( A )serves to regularize tlie affine transformation by bounding the scale and shear components. 111 is a possibly fuzzy correspondence matrix that matches points in one image with corresponding points in the other image. The constraints on iii ensure that each point in each image corresponds t o at most one point in the other image. However, partial matches are allowed, in which czje the sum of these partial matches may add up to no more than one. The inequality constraint on 111 permits a null match or multiple partial matches. [Note: simplex constraints on i n , and its linear appearance in Ei Iii I, imply that any local minimum of (nr. A. t ) occurs at a vertex in the m simplex. But 177's trajectory can use the interior of the i i l simplex to avoid local minima in the optimization of A and t . ] The (1 term biases the objective toward matches. The decomposition of A in the above is not required, since A could be left a s a 2 x 2 matrix and solved for directly in the algorithm that follows. The decomposition just provides for more precise regularization, i.e., specification of the likely kinds of transformations. Also Slrl(c) could be replaced by another rotation matrix, using the singular value decomposition of A. Then given two sets of points {X,)and {Yk} the distance between them may be defined as
D({X,}. {Yk}) = min[Ep,,,im.t.A) 1 constraints on i n ] rir.t.:2 This measure is an example of a more general image distance measure derived from mean field theory assumptions in (Mjolsness 1993):
Learning with Preknowledge
791
where d(x. y) = - log
( x IY1
max, Pr(x,y)
and T is a set of transformation parameters introduced by a visual grammar (Mjolsness 1994) and Pr is the probability that x arises from y without transformations T. We transform our inequality constraints into equality constraints by introducing slack variables, a standard technique from linear programming: K
Vj c m , k I 1
Kfl
-+
Qj C m , k = l k=l
k=l
and likewise for our column constraints. An extra row and column is added to the permutation matrix m to hold our slack variables. These constraints are enforced by applying the Potts glass mean field theory approximations (Peterson and Soderberg 1989) and a Lagrange multiplier and then using an equivalent form of the resulting objective, which employs Lagrange multipliers and an x log x barrier function (Yuille and Kosowsky 1994; Rangarajan et al. 1996; Mjolsness and Garrett 1990): I
K
1
I
[+'
+c&c ,=I
k=l
mik
-
1
+
K
I: ( uk
k=1
cn7,k
-
K
)
1
(3.2)
In this objective, we are looking for a saddle point. Equation 3.2 is minimized with respect to m, t, and A, that are the correspondence matrix, translation, and affine transform, and is maximized with respect to p and v, the Lagrange multipliers that enforce the row and column constraints for m. m is fuzzy, with the degree of fuzziness dependent on $. The above defines a series of distance measures, since given the decomposition of A it is trivial to construct measures that are approximately invariant only under some subset of the transformations (such as rotation and translation). The regularization, g(A), and o terms may also be individually adjusted in an appropriate fashion for a specific problem domain. For example, replacing A with X(0)in equation 3.1 and removing g(A) would define a new distance measure, which is invariant only under rotation and translation.
Steven Gold, Anand Rangarajan, and Eric Mjolsness
792
3.2 Weighted Graph-Matching Distance Measures. The following distance measure quantifies the degree of dissimilarity between two unlabeled weighted graphs. Given two graphs, represented by adjacency matrices Gi, and gkp, one can minimize the objective below to find the permutation which best maps G onto g (Rangarajan and Mjolsness 1994; von der Malsburg 1988; Hopfield and Tank 1986):
Exnl())I)
=
YL
CC
'
j=l k = l
c
G,ifilif
/=I
P
-
c
l2$,
)77,1J~p~
p= I
with constraints: Vj C,"=,m,i; = 1, V k ??ilk = 1, Vjk 1i7,k 2 0. These constraints are enforced in the same fashion as in equation 3.2 with an x log x barrier function and Lagrange multipliers. The objective is simplified with a fixed point preserving transformation of the form X' 3 20X-rr2. The additional variable (01 introduced in such a transformation, described as a reversed neuron in Mjolsness and Garrett (19901, is similar to a Lagrange parameter. A self-amplification term is also added to push the match variables toward zero or one. This term (with the 7 parameter below) is similarly transformed with a reversed neuron. The resulting objective is
As in Section 2.1, we look for a saddle point. Equation 3.3 is minimized with respect to i n and rr, which are the correspondence matrix and reversed neuron of the transform, and is maximized with respect to h , A, and 1 1 , the Lagrange multipliers that enforce the row and column constraints for 172 and the reversed neuron parameter enforcing the first fixed point transformation. m may be fuzzy, so a given vertex in one graph may partially match several vertices in the other graph, with the degree of fuzziness dependent upon ,I; however, the self-amplification term dramatically reduces the fuzziness at high ,d. A second, functionally equivalent, graph-matching objective is also used in the clustering problem (as explained in Section 3.3): I
L
K
P
(3.4)
Learning with Preknowledge
793
3.3 The Clustering Objective. The object learning problem is formulated as follows: Given a set of I noisy observations {XI} of some unknown objects, find a set of B cluster centers { Yi,} and match variables {MI!,}defined as
Mril=
{
1 if XIis in Y[,'s cluster 0 otherwise
such that each observation is in only one cluster, and the total distance of all the observations from their respective cluster centers is minimized. The cluster centers are the learned approximations of the original objects. To find {Yi,} and {M,b>minimize the cost function, I
B
Ecluster(Y*M)= CCMr[iD(Xi.Yb) r=l h = l
with constraints: b'i CbMlb= 1, b'ib Mrb 2 0. D(X,.Yb), the distance function, is a measure of dissimilarity between two objects. This problem formulation may be derived from Bayesian inference of a set of object models {Y} from the data {X} they explain (Mjolsness 1993). It is also a clustering objective with a domain-specific distance measure (Gold et al. 1994). The constraints on M are enforced in a manner similar to that described for the distance measure, except that now only the rows of the matrix M need to add to one, instead of both the rows and the columns. The Potts glass mean field theory method is applied and an equivalent form of the resulting objective is used:
(3.5) Here, the objects are point-sets or weighted graphs. If point-sets are used, the distance measure D(X,. Yb)is replaced by equation 3.1; if graphs are used it is replaced by equation 3.3, without the terms that enforce the constraints, or equation 3.4. For example, after replacing the distance measure by equation 3.1, we obtain
794
Steven Gold, Anand Rangarajan, and Eric Mjolsness
(3.6) A saddle point is required. The objective is minimized with respect to Y, M, 111, t, and A, which are, respectively, the cluster centers, the cluster membership matrix, the correspondence matrices, the translations, and other affine transformations. It is maximized with respect to A, which enforces the row constraints for M, and / / and I / , which enforce the column and row constraints for H I . M is a cluster membership matrix, indicating for each object i which cluster b it falls in, and m , ~is, a permutation matrix that assigns to each point in cluster center Y,,a corresponding point in observation X,. (A!/,. t,li) gives the affine transform between object i and cluster center b. Both M and ?TI are fuzzy, so a given object may partially fall in several clusters, with the degree of fuzziness depending on I,,, and &. Therefore, given a set of observations, X, we construct EclLlbter and upon finding the appropriate saddle point of that objective, we will have Y, their cluster centers, and M, their cluster memberships. An objective similar to equation 3.6 may be constructed using the graph-matching distance measure in equations 3.3 or 3.4 instead. 4 The Algorithm
4.1 Overview-Clocked Objective Functions. The algorithm to minimize the clustering objectives consists of two loops-an inner loop to minimize the distance measure objective (either equation 3.2 or 3.3) and an outer loop to minimize the clustering objective (equation 3.5). Using coordinate descent in the outer loop results in dynamics similar to the EM algorithm for clustering (Hathaway 1986). The EM algorithm has been similarly used in supervised learning (Jordan and Jacobs 1994). All variables occurring in the distance measure objective are held fixed during this phase. The inner loop uses coordinate ascent/descent, which results in repeated row and column normalizations for m. This is described as a softassign (Gold rt 01. 1995; Gold and Rangarajan 1996; Rangarajan et rrl. 1996) (see Section 4.2). The minimization of m, and the distance measure variables (either t, A of equation 3.2 or /I,, (T of equation 3.31, occurs in an incremental fashion-that is, their values are saved after each inner
Learning with Preknowledge
795
loop call from within the outer loop and are then used as initial values for the next call to the inner loop. This tracking of the values of the distance measure variables in the inner loop is essential to the efficiency of the algorithm since it greatly speeds up each inner loop optimization. Most coordinate ascent/descent phases are computed analytically, further speeding up the algorithm. Some poor local minima are avoided by deterministic annealing in both the outer and inner loops. The resulting dynamics can be concisely expressed by formulating the objective as a clocked objective function (Mjolsness and Miranker 1993), which is optimized over distinct sets of variables in phases, as [letting 2) be the set of distance measure variables (e.g., {A$}for equation 3.2) excluding the match matrix],
with this special notation employed recursively: E ( x ,y)@,coordinate descent on x, then y, iterated (if necessary); xA, use analytic solution for x phase. The algorithm can be expressed less concisely in English, as follows: Initialize D to the equivalent of an identity transform, Y to random values Begin Outer Loop Begin Inner Loop Initialize D with previous values Find rn, D for each ib pair : Find m by softassign Find 2)by coordinate descent End Inner Loop If first time through outer loop increase $,t, and repeat inner loop Find M,Y using fixed values of rn, D, determined in inner loop: Find M by softmax, across i Find Y by coordinate descent increase &, Ptll End Outer Loop When the distances are calculated for all the X-Y pairs the first time through the outer loop, annealing is needed to minimize the objectives accurately. However, on each succeeding iteration, since good initial estimates are available for D (namely the values from the previous iteration of the outer loop), annealing is unnecessary and the minimization is much faster. The speed of the above algorithm is increased by not recalculating the X-Y distance for a given ib pair when its M,b membership variable drops below a threshold. 4.2 Inner Loop. The inner loop proceeds in two phases. In phase one, while D are held fixed, rn is initialized with a coordinate descent
Steven Gold, Anand Rangarajan, and Eric Mjolsness
796
step, described below, and then iteratively normalized across its rows and columns until the procedure converges (Kosowsky and Yuille 1994). This phase is analogous to a softmax update, except that instead of enforcing a one-way, winner-take-all (maximum) constraint, a two-way, assignment constraint is being enforced. Therefore, we describe this phase as a softassign (Gold et al. 1995; Gold and Rangarajan 1996; Rangarajan ef a/. 1996). In phase two ni is held fixed and 2)are updated using coordinate descent. Then /All is increased and the loop repeats. Let be the distance measure objective (equations 3.2 or 3.3) without the terms that enforce the constraints ke., the x logx barrier function and the Lagrange parameters). In phase one n7 is updated with a softassign, which consists of a coordinate descent update: mib]k
=
eXP[-i/jriraEdmioc(X,.Y b ) / a l f l r b , k ]
And then (also as part of the softassign) m is iteratively normalized across j and k until C!=, Cf='=, I h z , [ , ] k l < 6:
Using coordinate descent, the 2) are updated in phase two. If a member of V cannot be computed analytically (such as the terms of A that are regularized), Newton's method is used to compute the root of the function. So if d,, is the rzth member of V then in phase two we update d,,,,, such that
Finally ;$,, is increased and the loop repeats. By setting the partial derivatives of E,~,,,to zero and initializing the Lagrange parameters to zero, the algorithm for phase one may be derived (Rangarajan et al. 1996). allows minimization over a fuzzy corBeginning with a small respondence matrix m, for which a global minimum is easier to find. Raising /$,,drives the ins closer to 0 or 1, as the algorithm approaches a saddle point. 4.3 Outer Loop. The outer loop proceeds in three phases: (1) distances are calculated by calling the inner loop, (2) M is projected across b using the softmax function, (3) coordinate descent is used to update Y. Therefore, using softmax, M is updated in phase two:
Learning with Preknowledge
797
Y, in phase three, is calculated using coordinate descent. Let yn be the nth member of {Y}. yn is updated such that (4.2) Then DM is increased and the loop repeats. When learning prototypical point-sets, Ybrl in equation 4.1 will be either the x or y coordinate of a point in the prototype (cluster center). If weighted graphs are being learned then ybtl will be a link in the cluster center graph. When clustering graphs, equation 3.3 is used for the distance in equation 4.1 while equation 3.4 is used to calculate Y b l l in equation 4.2. This results in a faster calculation of equation 4.1, but for equation 4.2 results in an easy analytic solution. 5 Methods and Experimental Results
Five series of experiments were run to evaluate the learning algorithms. Point sets were clustered in four experiments and weighted graphs were clustered in the fifth. In each experiment, a set of object models was used. In one experiment handwritten character data were used for the object models and in the other experiments the object models were randomly generated. From each object model, a set of object observations was created by transforming the object model according to the problem domain assumed for that experiment. For example, an object represented by points in two-dimensional space was translated, rotated, scaled, sheared, and permuted to form a new point set. An object represented by a weighted graph was permuted. Independent noise of known variance was added to all real-valued parameters to further distort the object. Parts of the object were deleted and spurious features (points) were added. In this manner, from a set of object models, a larger number of object instances were created. Then, with no knowledge of the original object models or cluster memberships, we clustered the object instances using the algorithms described above. The bulk of our experimental trials were on randomly generated patterns. However, to clearly demonstrate our methods and visually display our results, we will first report the results of the experiment in which we used handwritten character models. 5.1 Handwritten Character Models. An X-windows tool was used to draw handwritten characters with a mouse on a writing pad. The contours of the images were discretized and expressed as a set of points in the plane. Twenty-five points for each character were used. The four characters used as models are displayed in row 1 of Figure 1. Each character model was transformed in the manner described above to create 32
798
Steven Gold, Anand Rangarajan, and Eric Mjolsness
Figure 1: Row (1): Handwritten character models used t o generate character instances. These models were not part of the input to the clustering algorithm. Rows (2-5): 16 character instances that (with 112 other characters) were clustered. character instances (128 characters for all four). Specifically (in units normalized approximately to the height of b in Fig. 1):.C’(0.0.02) of gaussian noise was added to each point. Each point had a lo%,probability of being deleted a n d a 5% probability of generating a spurious point. The com-
Learning with Preknowledge
799
ponents of the affine transformation were selected from a uniform distribution within the following bounds; translation: 50.5, rotation: f27@, log(scale): log(0.7), log(vertical shear): flog(0.7), and log(obliqireshenr): i l o g ( 0 . 7 ) . Note in equation 3.1, a = log(scale), h = log(zJerticalshear), and c = log(oblique skenr). In rows 2-5 of Figure 1, 16 of the 128 characters generated are displayed. The clustering algorithm using the affine distance measure of Section 2.1 was run with the 128 characters as input and no knowledge of the cluster memberships. Figure 2 shows the results after 0, 4, 16, 64, 128, and 256 iterations of the algorithm. Note that the initial cluster center configurations (row 1 of Fig. 2) were selected at random from a uniform distribution over a unit square. The original models were reconstructed to high accuracy from the data, up to affine transformations within the allowed ranges.
*
5.2 Randomly Generated Models. In the next four experiments, the object models (corresponding to the models in Row 1 of Fig. 1) were generated at random. The results were evaluated by comparing the object prototypes (cluster centers) formed by each experimental run to the object models used to generate the object instances for that experiment. The distance measures used in the clustering were used for this comparison, i.e., to calculate the distance between the learned prototype and the original object. This distance measure also incorporates the transformations used to create the object instances. The mean and standard deviations of these distances were plotted (Fig. 3) over hundreds of trials, varying the object instance generation noise. The straight line appearing on each graph displays the effect of the gaussian noise only. It is the expected object model-object prototype distance if no transformations were applied, no features were deleted or added, and the cluster memberships of the object instances were known. It serves as an absolute lower bound on the accuracy of our learning algorithm. The variance of the real-valued parameter noise was increased in each series of trials until the curve flattened-that is, the object instances became so distorted by noise that no information about the original objects could be recovered by the algorithm. In the first experiment (Fig. 3a), point set objects were translated, rotated, scaled, and permuted. Initial object models were created by selecting points with a uniform distribution within a unit square. The transformations to create the object instance were selected with a uniform distribution within the following bounds; translation: f0.5, rotation: f27”, log(scale): flog(0.5). For example, within these bounds the largest object instances that are generated may be four times the size of the smallest. One hundred object instances were generated from 10 object models. All objects contained 20 points.The standard deviation of the gaussian noise was varied from 0.02 to 0.16 in steps of 0.02. At each noise level, there were 15 trials. The data point at each error bar represents 150 distances (15 trials times 10 model-prototype distances for each trial).
Steven Gold, Anand Rangarajan, and Eric Mjolsness
800
. . . :. .
I . .
.,. . ... I
'
'
1 CL ,'
:.? ''
Figure 2: Row (1): initial cluster centers (randomly generated). Rows ( 2 4 ) : character prototypes (cluster centers) after 4, 16, 64,128, and 256 iterations.
Learning with Preknowledge
801
E2 I=
m In ._ Dl I
0.05 0.1 0.15 standard deviation
0'
0.05 0.1 0.1 standard deviation
3 a2
2 m
c In
5 1
0'
0.05 0.1 0.15 standard deviation
Figure 3: (a) Ten clusters, 100 point sets, 20 points each, scale, rotation, translation, 120 trials; (b) 4 clusters, 64 point sets, 15 points each, affine, 10% deleted, 5% spurious, 140 trials; (c) 8 clusters, 256 point sets, 20 points each, affine, 10% deleted, 5% spurious, 70 trials; (d) 4 clusters, 64 graphs, 10 nodes each, 360 trials. In the second and third experiments (Fig. 3b and c), point set objects were translated, rotated, scaled, sheared (both components), and permuted. Each object point had a 10% probability of being deleted and a 5% probability of generating a spurious point. Object points and transformations were randomly generated as in the first experiment, except for these bounds; log(scale): & log(0.7),log(vertical shear): flog(0.7), and log(oblique shear): flog(O.7). In experiment 2, 64 object instances and 4 object models of 15 points each were used. In experiment 3, 256 object instances and 8 object models of 20 points each were used. Noise levels as in experiment 1 were used. Twenty trials were run at each noise level in experiment 2 and 10 trials run at each noise level in experiment 3. In the fourth experiment (Fig. 3d), object models were represented by fully connected weighted graphs. The link weights in the initial object models were selected with a uniform distribution between 0 and 1. The objects were then randomly permuted to form the object instance and uniform noise was added to the link weights. Sixty-four object instances
802
Ste\ en Gold, Anand Rangarajan, and Eric
Mjolsness
were generated froni 1 object models consisting of 10 node graphs with 100 links. The standard deviation oi the noise was varied from 0.01 to 0.13 in steps of 0.01. There were 30 trials at each noise level. In most experiments, at low noise levels ( 5 0.06 for point sets, 5 0.03 for graphs), the object prototypes learned were Lwy similar to the object models. As an example of \That the plotted distances mean in terms of visual similarity, the average model-prototype distance in tlie handwritten character example (row 1 of Fig. 1 and row 6 of Fig. 2) was 0.5. Even at higher noise l e ~ e l s object , prototypes similar to the object models are fcxnied, though less consistently. Results from about 700 experiments are plotted, which took several thousand hours of SGI R4400 workstation processor time. The objecti1.e for experiment 3 contained close to one million variables and converged in about 1 hr. The convergence times of the objectives of experiments 1, 2, and 4 were 120, 10, and 10 min, respectively. In these experiments the temperature parameter of the inner loop equaled the temperature parameter of the outer loop ( i,,, = iZ.1) and both were increased by a factor of 1.03 on each iteration of the outer loop. In the point set experiments, each trial was a best of four series. The object models and object instances were the same for each ot the four executions xcithin the trial, but the initial randomly selected starting cluster centers (Rmc 1 of Fig. 2) were varied for each execution and only the result from tlie execution with the lowest ending energy was reported. The time for recognition, which simply involved running the distance measures alone, was at most a few seconds for the largest point-sets, which contained 25 points. 6 Conclusions
.-
It has lung been argued by many that learning in complex domains typically associated with human intelligence requires some type of prior structure or knowledge. We have begun to develop a set of tools that will allow the incorporation of prior structure within learning. Our models incorporate many features needed in complex domains like vision: parameter noise, missing and spurious features, nonrigid transformations. They can learn objects with inherent structure, like graphs. Many experiments have been run on experimentally generated data sets. Several directions for future research hold promise. One might be the learning of OCR data. Second, a supervised learning stage could be added to our algorithms, i.e., we may include some prior knowledge regarding the classification or labeling of our objects. While the experiments in this paper incorporated only a few missing points within the object sets, the point-matching distance measures are capable of matching objects arising from real image data with large amounts of occlusion and with feature points that do not necessarily lie in one-to-one correspondence with each other as did the artificially generated point sets of this paper (Gold et a / .
Learning with Preknowledge
803
1995). Supervised learning algorithms may be better able to exploit the power of these distance measures. Finally, more powerful, recently developed graph-matching distance measures (Gold and Rangarajan 1996) m a y be used that are able to operate on graphs with attributed nodes, multiple link types, a n d deleted or spurious nodes a n d links.
Acknowledgments This work has been supported by AFOSR Grant F49620-92-J-0465 a n d ONR/DARPA Grant N00014-92-J-4048 a n d the Yale Neuroengineering a n d Neuroscience Center.
References Bienenstock, E., and Doursat, R. 1991. Issues of representation in neural networks. In Representations of Vision: Trends and Tacit Assumptions in Vision Research, A. Gorea, ed. Cambridge University Press, Cambridge, England. Buhmann, J., and Hofmann, T. 1994. Central and pairwise data clustering by competitive neural networks. In Advances in Neural lnformation Processing Systems 6 , J. Cowan, G. Tesauro, and J. Alspector, eds., pp. 104-111. Morgan Kaufmann, San Mateo, CA. Buhmann, J., and Kuhnel, H. 1993. Complexity optimized data clustering by competitive neural networks. Neural Comp. 5(1), 75-88. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. I. R. Statist. Soc. Ser. B 39, 1-38. Duda, R., and Hart, P. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4(1), 1-58. Gold, S., and Rangarajan, A. 1996. A Graduated Assignment Algorithm for Graph Matching. l E E E Trans. Patt. Anal. Mach. Intell. (In press.) Gold, S., Mjolsness, E., and Rangarajan, A. 1994. Clustering with a domain specific distance measure. In Advances in Neural Information Processing Systems 6 , J. Cowan, G. Tesauro, and J. Alspector, eds., pp. 96-103. Morgan Kaufmann, San Mateo, CA. Gold, S., Lu, C. P., Rangarajan, A,, Pappu, S., and Mjolsness, E. 1995. New algorithms for 2-D and 3-D point matching: Pose estimation and correspondence. In Advances in Neural lnformation Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 957-964. MIT Press, Cambridge, MA. Grimson, E. 1990. Object Recognition by Computer: The Roleof Geometric Constraints. MIT Press, Cambridge, MA. Hathaway, R. 1986. Another interpretation of the EM algorithm for mixture distributions. Statist. Probability Lett. 4, 53-56.
804
Steven Gold, Anand Rangarajan, and Eric Mjolsness
Hopfield, J. J., and Tank, D. W. 1986. Collective computation with continuous variables. In Disordered Systenis and Biological Organization, pp. 155-170. Springer-Verlag, Berlin. Jain, A. K., and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6(2), 181-214. Kosowsky, J. J., and Yuille, A. L. 1994. The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Netzuorks 7(3), 477490. Kurita, T., Sekita, I., and Otsu, N. 1994. Invariant distance measures for planar shapes based on complex autoregressive model. Patt. Recogn. 27(7), 903-911. Mjolsness, E. 1993. Bayesian inference on visual grammars by relaxation nets. Unpublished manuscript. Mjolsness, E. 1994. Connectionist grammars for high-level vision. In Artificial I n telligence aiid Neural Networks: Steps TouJardPrincipled Integration, V. Honavar, and L. Uhr, eds., pp. 423451. Academic Press, San Diego, CA. Mjolsness, E., and Garrett, C. 1990. Algebraic transformations of objective functions. Neural Netzuorks 3, 651469. Mjolsness, E., and Miranker, W. 1993. Greedy Lagrarzginns for Neural Netulorks: Three L ~ Z of J ~Optimization S i i i Rrlaxntioii Dyiininics. Tech. Rep. YALEU/DCS/ TR-945, Department of Computer Science, Yale University. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Ozt. J . Neural Syst. 1(1), 3-22. Rangarajan, A., and Mjolsness, E. 1994. A Lagrangian relaxation network for graph matching. In I E E E Iiiternntiorinl Coizfereizce 011 Neural Netmorks (ICNN), Vol. 7, pp. 4629-4634. IEEE Press, New York. Rangarajan, A., Gold, S., and Mjolsness, E. 1996. A novel optimizing network architecture with applications. Neirral Coiiip. (in press). Rose, K., Gurewitz, E., and Fox, G. 1990. Statistical mechanics and phase transitions in clustering. Phys. Rev. Lett. 65(8), 945-948. Shepard, R. 1989. Internal representation of universal algorithms: A challenge for connectionism. In Neural Coiziiectioiis, Mental Conzpiitatioiz, L. Nadel, L. Cooper, P. Culicover, and R. Harnish, eds., pp. 104-134. Bradford/MIT Press, Cambridge, MA. Simard, P., le Cun, Y., and Denker, J. 1993. Efficient pattern recognition using a new transformation distance. In Adzv~izcesin Nei~raIIiiforniation Processing Systems 5, S. Hanson, J. Cowan, and C. Giles, eds., pp. 50-58. Morgan Kaufmann, San Mateo, CA. von der Malsburg, C. 1988. Pattern recognition by labeled graph matching. Neura1 Netzuorks, 1, 141-148. Williams, C., Zemel, R., and Mozer, M. 1993. U i i s i r p e r z k f Learning of Object Models. Tech. Rep. AAAI Tech. Rep. FSS-93-04, Department of Computer Science, University of Toronto. Yuille, A. L., and Kosowsky, J. J. 1994. Statistical physics algorithms that converge. Neirral Coinp. 6(3), 341-356. Received July 18, 1994; accepted October 16, 1995.
This article has been cited by: 2. N. Duta, A.K. Jain, M.-P. Dubuisson-Jolly. 2001. Automatic construction of 2D shape models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:5, 433-446. [CrossRef] 3. T. Aonishi, K. Kurata. 2000. Extension of dynamic link matching by introducing local linear maps. IEEE Transactions on Neural Networks 11:3, 817-822. [CrossRef] 4. Andrew M. Finch , Richard C. Wilson , Edwin R. Hancock . 1998. An Energy Function and Continuous Edit Process for Graph MatchingAn Energy Function and Continuous Edit Process for Graph Matching. Neural Computation 10:7, 1873-1894. [Abstract] [PDF] [PDF Plus] 5. A.D.J. Cross, E.R. Hancock. 1998. Graph matching with a dual-step EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:11, 1236-1253. [CrossRef] 6. S. Gold, A. Rangarajan. 1996. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:4, 377-388. [CrossRef]
Communicated by Eric Baum
Analog versus Discrete Neural Networks Bhaskar DasGupta Department of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
Georg Schnitger Fachbereich 20, lnformatik, Universitat Frankfurt, 60054 Frankfurt, Germany
We show that neural networks with three-times continuously differentiable activation functions are capable of computing a certain family of n-bit Boolean functions with two gates, whereas networks composed of binary threshold functions require at least R(1ogn) gates. Thus, for a large class of activation functions, analog neural networks can be more powerful than discrete neural networks, even when computing Boolean functions. 1 Introduction
Artificial neural networks have become a popular model for machine learning and many results have been obtained regarding their application to practical problems. Typically, the network is trained to encode complex associations between inputs and outputs during supervised training cycles, where the associations are encoded by the weights of the network. Once trained, the network will compute an input/output mapping that (hopefully) is a good approximation of the original mapping. In this paper we are mostly interested in feedforward neural networks, i.e., neural networks whose underlying graph is acyclic. We concentrate on computing Boolean functions to allow a comparison of the computing power of analog and discrete neural networks. We start by formally introducing feedforward neural networks (with binary inputs and a single output neuron).
Definition 1.1. Let y : R
- R be given.
(a) The architecture of a y-net C is given by a directed graph G with a single sink (i.e., a single vertex with no outgoing edge). C is obtained, if we additionally specify a labeling of the edges and vertices of G by real numbers. The real number assigned to an edge (respectively vertex) is called its weight (respectively its threshold). Neiivul Cornputntiorl 8, 805-818 (1996)
@ 1996 Massachusetts Institute of Technology
Bhaskar DasGupta and Georg Schnitger
806
(b) C computes a function fc : (0, l}”-+ R as follows. The components of the input vector x = ( X I , . . . ,x,]) are assigned to the sources of G (i.e., to the vertices of G with no incoming edge). Let u1,.. . ,u,be the immediate predecessors of vertex u. The input for u is then
where wi is the weight of the edge (ui,u), t,,, is the threshold of ZI, and yi is the value assigned to zli. If u is not the sink, then we assign the value y[sv(x)]to u. Otherwise, we assign s,,(x)to u. (c) The size of C is the number of vertices of its architecture G (excluding the sources). The depth of C is the number of edges on a longest directed path from the sources to the sink. Since the output of a y-net is a real value, an appropriate convention has to be adopted when computing a Boolean function. We employ the same convention as in Maass ef nl. (1991). Definition 1.2. Let E be a positive real number and let C be a y-net. Then C computes the Boolean function F : (0, l}” + (0. l} with separation c provided there exists a real number tc, such that for all (xl... . x,]) E (0.1)”
F(x1.. . . .x,,) = 0 H fc(x1.. . .x,,) 5 tc - F F(x1,.. . .x,]) = 1 ++ f ~ ( x 1 . . x,!) 2 fc t
+
y-nets (for various activation functions y and real-valued domain) have been investigated for their approximation power and other related complexity-theoretic properties (DasGupta and Schnitger 1993; Hoffgen 1993; MacIntyre and Sontag 1993; Maass 1993; Zhang 1992). In this paper, however, we consider the computation of Boolean functions only. Our goal is a comparison of the computational power of a large class of y-nets and of binary threshold izetworks, i.e., 7-l-nets with the binary threshold function 7-l defined by ‘ H ( x )=
0 1
ifx
(Since we will consider ‘H-nets for the computation of Boolean functions, the binary threshold function will also be used for the output gates.) Threshold networks have been extensively studied in the literature. In Reif (1987) it is shown that binary threshold networks of bounded depth (and size polynomial in n ) can compute the sum and the product of iz n-bit numbers. In Hajnal et nl. (1987) and Goldmann and Hastad (1987) lower bounds on the size of depth two (respectively depth three) binary threshold networks are given. Thus binary threshold networks
Analog vs. Discrete Neural Networks
807
are known to be powerful and lower bounds (when computing Boolean functions) are correspondingly rather weak. Moreover, the binary threshold function also plays an important role when considering the approximation power of neural networks with real inputs and outputs. In DasGupta and Schnitger (1993), activations functions are considered which are capable of tightly approximating polynomials and the binary threshold function 'H with neural nets of bounded depth and small size. These activation functions have, therefore, at least the approximation power of spline-networks and thus have considerable approximation power. This fact is used in DasGupta and Schnitger (1993) to show the "equivalence" of various activation functions including the standard sigmoid O(X) = 1/(1+e-'), rational functions, and roots (which are not polynomials). The following question, originally posed in Sontag (1990)' is the main topic of this paper: 0
Does there exist a family of Boolean functions fil : (0. l}" + (0. l } that can be computed by g-nets with a constant number of gates, but that requires binary threshold networks with more than a constant number of gates?
Thus, our goal is a comparison of the computational power of analog and discrete neural networks with binary inputs. If the number of inputs is counted when determining network size, then binary threshold networks are equivalent to a-nets for the case of bounded depth and small weights even if we do not allow depth to increase (but allow a polynomial increase in size; see Maass et al. 1991). The simulation results in DasGupta and Schnitger (1993) imply that binary threshold networks and a-nets are equivalent when computing Boolean functions even for unbounded depth and even with large weights provided depth of the simulating binary threshold network is allowed to increase by a constant factor and size is allowed to increase polynomially. But the above equivalence does not hold if we d o not count the number of inputs when determining network size, and analog computation may indeed turn out to be more powerful. This was first demonstrated in Maass et al. (1991), who construct a family fn of Boolean functions that can be computed by a y-nets (for a large class of functions 7) of constant size in depth two. It is then shown that binary threshold networks of depth two require nonconstant size. On the other hand, each function fn can be computed by binary threshold networks in depth three and constant size. This separation result is therefore depth-dependent. In this paper, we give a separation that holds for arbitrary depth. In particular, we consider the problem of "unary squaring," i.e., the family of languages SQ,, with
SQn
= ((x,y): x E
(0, I}",
y E (0,1}"2, and
1x1'
2 [A>
808
Bhaskar DasGupta and Georg Schnitger
( [ z ]denotes the bit sum of the binary string z; i.e., [z]= C,zl.) We obtain the following result. Theorem 1.1. A biiiary threshold iietzuork acceytiiig SQ,, must have size at lemt (](log1 1 ) . Birt SQ,, can be coiizpirted by II 0-net 7 ~ i i t htziio gates.
In fact, we give a generalized upper bound in Theorem 2.1, where y-nets with two gates are constructed for a large class of functions ?. The lower bound of Theorem 1.1 is ”almost” tight, since it is possible to design a binary threshold net of size O(log iz . log log iz . log log log i i ) that accepts SQll. The proof of 1.1 uses techniques of circuit theory. We refer the reader to Wegener (1987) and Boppana and Sipser (1990) for a detailed account of circuit theory and restrict ourselves to a few comments. A circuit corresponds (using the notation of this paper) to a r-net, where r is a class of Boolean functions and where functions in r are assigned to the vertices of the net-architecture. {AND, OR, NOT}-circuits are perhaps the most prominent circuit class. One of the main tasks of circuit theory is to derive lower bounds for the size and/or depth of circuits computing specific Boolean functions. Little progress has been made in deriving lower bounds for {AND, OR, NOT}-circuits of bounded fan-in. (The fan-in of a circuit is the maximum, over all vertices, of the number of immediate predecessors of a vertex.) For instance, no specific function is known that requires superlinear size. The situation improves considerably if {AND, OR, NOT}-circuits of u i i borriided fan-in (and small depth) are considered: Razborov (1987) gives exponential lower bounds for the size of circuits computing the parity of I? bits in bounded depth. However, as already mentioned above, thresholdcircuits (or threshold networks in our notation) of bounded depth have an impressive computing power and, perhaps not surprisingly, not even superlinear lower bounds on the size are known. In Wegener (1991) sublinear lower bounds on the size of threshold circuits are given. There the notion of sensitivity is introduced: a Boolean function f of n variables is called k-sensitive, if no setting of 11 - k variables to arbitrary (zero or one) values transforms f into a constant function of the remaining k free variables. (For example, the parity function of n variables is k-sensitive for any k with 1 5 k 5 n.) We face the problem that SQ,l is not k-sensitive even for large values of k; for instance, if we set all x-bits to 1 and one y-bit to 0, then SQn becomes a constant function of the remaining free variables. Also, intermediate forms of sensitivity (i.e., k-sensitivity in which at least a constant fraction of both the x-bits and the y-bits are set) have to be ruled out: setting half of the x-bits to 0 and setting half of the y-bits to 1 again reduces SQ,I to a constant function of the remaining free variables. Therefore, Wegener’s lower bound for sensitive functions (Wegener 1991) does not apply. Our lower bound proof does proceed by trying to examine the given circuit gate by gate. But we were not successful in trivializing each gate ke.,
Analog vs. Discrete Neural Networks
809
by setting input bits to appropriate zero/one values, we were unable to guarantee that a considered gate computes a constant function of the remaining free input bits). Instead we construct a subdomain of the input space that allows us to trivialize threshold gates while not trivializing SQn. The rest of the paper is organized as follows. In Section 2 we show that SQ,,can be computed by y-nets with two gates, where y is any realvalued activation function at least three times continuously differentiable in some small neighborhood. In Section 3 we prove that any binary threshold network accepting SQn must have size at least R(1ogn). A preliminary version of this result appeared in DasGupta and Schnitger (1993). 2 Computing SQ,, by y-nets
We say, that a function y : R + R has the Q-property, if and only if there exist real numbers a and 6 > 0 such that (a) ? ( x ) is at least 3 times continuously differentiable in the interval [a - h.a + 61 and (b) y”(a) # 0. Notice that the standard sigmoid .(x) = 1/(1+c’) has the &-property. Next we show that SQllcan be computed with relatively large separation by small ?-nets with small weights, provided y has the Q-property. Theorem 2.1. Assume that y has the Q-property. Then there is a ?-net with tzoo gates that accepts SQ,,zuith separation Q( 1 ) . Moreover, all weights are bounded in absolute value by a polynomial in 1 2 . The proof of Theorem 2.1 utilizes the Q-property to extract square polynomials from . In particular, we approximate the quadratic polynomial [XI’ - [y] with small error. Finally, the function SQli is computed with 12(1)-separationby comparing the approximated polynomial with a suitable threshold value.
-,
Proof of Theorem 2.1. Since y is at least 3-times continuously differentiable in 1 = [u - h. u + 01, we obtain
where r ( z ) = ?“’(0,)/6 z3 (for z E [-6.b] and some BL t I). Moreover, by continuity, there is a constant Max with I yl”(u)1 5 Max for all 14 E I. We set
. { 1 3Max , ?”(a) 1 El} 4.
= max
n3
’
Bhaskar DasGupta and Georg Schnitger
810
Since 0 I [XI 5
12,
we obtain I[x]/LI 5 In/LI 5 S and thus
or, equivalently,
) ~ we obtain the bound Also, since ~ - y ” ’ ( d ~5x ~Max,
+
The y-net accepting SQ,, consists of a first neuron computing u ( x ) = ? [ a ( [ x ] / L ) ] The . second neuron, the output neuron, computes the weighted sum
As a consequence of equation 2.1, the output neuron approximates [x’] [y] with error at most 1/4. Thus,
Thus, setting tc = -1/2 in Definition 1.2, it follows that our y-net C accepts SQ,, with separation at least 1/4. The weight bound follows, since L = O(1z3). 3 A Lower Bound for “Unary Squaring”
We have to show the following result. Theorem 3.1. A17y binary threshold rzefzclork acceyfiizg SQ,, rnusf h z l e size nf least R(1og n ) . Let S Q , denote the language SQ,
= {(x.y): x E
(0,l}“. y
E
(0.
and
[xI2 = [y]}
Proposition 3.1. Assume that there exists a binary threshold network of size f,, accepting SQ,,. Then there exists a binary threshold network of size f,, + t,,+l + 1 accepting SQ,.
Analog vs. Discrete Neural Networks
811
Proof. Since there exists a binary threshold network of size t,, accepting SQ,,, there also exists a binary threshold network of size t,, accepting the complement of SQII. We show how to compute the language
a,
SQ‘
=
{(x.y): x E (0.1)”. y E {0.1}”2. and
[XI’
5 [y]}
Consider the binary threshold network for SQll+l, with binary inputs xI... . . x t I + l , and y l . . . . , Y ( ~ ~ + ~We ) Z .set x,,+I = 0, y,rz+l = 1, and y,,~+~ = . . . - Y ( , , + ~ ) Z = 0. With those bits fixed, the threshold network for SQ,,+1 accepts the input (XI. x2.. . . . xl13 ~ 1y2. % . . . . y,,~)if and only if (C::, x,)‘ < y,. Hence, size t,,+1 y, + 1. But this is equivalent to x,)5 ~ threshold circuits can compute SQ:. But note that SQ; = SQ,, A SQ:. Hence, SQ, can be computed with t,, + t,[+l + 1 threshold gates.
$,
(xy!,
Thus it suffices to show that any binary threshold network accepting SQ, must have size at least R(1ogn). Let us assume that C, is a binary threshold network with 5 gates accepting SQ,. Our approach will be to successively trivialbe (i.e., partially fix the outcomes of) the gates of C, by fixing appropriate bits of the input (x.y). The process of trivialization starts with source gates, continues with gates all of whose immediate predecessors have been trivialized and finally terminates with the sink gate of C,. Let us assume that the process of trivialization has reached gate g. Moreover, assume that the bits xk+l... . x,,and ~ 1 ~ 1. %.yl,2 . have been fixed with (xk+l... . .x,,)= and (yl+l.. . . .y,,z) = 11. Determine (Y with 1 = 2ak + k’ and set
<
%
domain(k,I,<,v) = {(x.<;y.q) : x E {O.l}k.y E {O-l}’ and 2 ~ [ x5] [y] L 2tr[x] + [xI2} as well as SQ,(k, 1, E , q ) = { (x,<;y>q ) E domain(k. 1.
<- q ) : ( a + [x])’ = a2 + [y]}
(a: will coincide with the number of x-bits that are set to one. Therefore, 0 >_ 0 and Q will increase only during the trivialization process.) We demand that the following imariunt holds for domain(k, 1. 7):
+
(a) I = 2ak k2, where Q is a nonnegative integer. (b) Each already processed gate is constant over domain(k. I , (, 71). (c) For every u E domain(k,l,<,q): C, accepts u if and only if u E sQ,’(k,k <, 7 ) . In other words, all previously processed gates have been trivialized over domain(k, I , I ,q ) , whereas the network C, still accepts the nontrivial language SQf(k, I , <,q ) . Thus, if g ( x ,y) denotes the function computed by
Bhaskar DasGupta and Georg Schnitger
812
gate 8 for (x. <; y. 11) E domain(k.1.
<.
I/),
then we obtain the representation
This follows, since g depends only on the free inputs of C, and a constant threshold-value t ( t is completely determined by the old threshold-value of g in the given circuit and the constant outputs of the already processed and therefore trivialized gates). Next we make a few basic observations concerning the trivialization process:
Proposition 3.2. (a) Before the process of trivialization starts, the invariant holds with k = n . / = i?, and (consequently) n = 0. (b) Assume that the sink of C, has been processed. Then, the processed network cannot accept SQ;(k. 1. <. I ] ) , unless k = 0. (c) Assume that k decreases by at most a factor of 1/96 for each processed gate. Then, C, must have at least S?(logii)gates in order to correctly accept SQ,. Proof. (a) is immediate and ( c ) is a direct consequence of (b). It remains to verify part (b). Assume that the sink of C,has been processed and domain(k. 1. (. 11) is obtained. Then the sink (and thus the network) is constant for all elements of domain(k. 1. t i ) , but accepts SQ;(k. 1. <.1 1 ) . This language, however, is not constant for k > 0: we have (Ok. <,0‘. 11) E SQ,(k, 1. (. I ] ) , whereas (0’. [. 10“. r l ) $ SQ,(k. 1. (.T I ) (observe that 1 = 2trk + k2 > 0). 0 [ ?
We will assume from now on that IZ is a power 96. If this is not the case, then it suffices to set an appropriate number of M- and y-bits to zero such that the number of free x-bits equals the next lower power of 96 (and the number of free y-bits is a square of the number of free x-bits). Assume that the invariant holds for domain(k. 1. (. 71) (with I = 2trk+k’) and that we fix additional bits leaving K x-bits and L y-bits free. Assume that, among those additionally fixed bits, r x-bits are set to one and s y-bits are set to one. Let [’ (respectively 11’) be the set of fixed x-bits (respectively y-bits) including the additionally fixed bits. When does the invariant hold for domain(K, L. [’. t i ’ ) ?
Proposition 3.3. The invariant holds for domain(K, L. [‘. I / ’ ) , provided 0 0
s = 2ar + r2 and L = 2(a + r ) K + K2 (and hence
N
is replaced by
N
+ r).
Proof. We first show that domain(K, L. [’, q’) domain(k. 1. [. 7 1 ) . Let 14 = (XI,xz>(;yl, y2>7 ) be an arbitrary element of domain(K. L . [’. TI’), where the bits in x2 (respectively y2) have been freshly fixed. Consequently, 2(.
+ r,[x,II[Yll I 2 ( . + r)[x11+ [x1I2
Analog vs. Discrete Neural Networks
813
and therefore
Condition (a) of the invariant is satisfied because of the assumed relationships between K and L. Since constant gates remain constant if the domain is further restricted, condition (b) of the invariant is satisfied as well. Now we consider condition (c). II E
SQ;(K.L.E’.
I/)
*
(0
H
(0 I
+ r + [x~])’= + r)2 + [yl] [(xl.~ 2 ) ] = ) ~n2 I ~ ( L YI r2 + [y,] ((1
*
.(
H
u E SQ,(k. 1. [. q )
H
C, accepts u.
+ [(x1.x2)1)2= + [(y,.y2)] (b2
Hence, to complete the proof of Theorem 3.1, we have to perform the trivialization process such that 0
k, the number of free bits, decreases by at most a factor of 1/96 for each processed gate and the invariant is maintained whenever a gate is processed (by setting 20r + r2 new y-bits to one, if r new x-bits are set to one).
We have to describe the trivialization of gate g. Observe that 1 = 20k+ k2 holds. Moreover, remember that g computes the function g(x.y) = (Cf=,a,x,I Xfzlb,y, 2 t ) . Throughout the trivialization process we will assume that k = 96‘ for some positive integer i. First we will reduce k by a factor of at most 1/96 after one step of the trivialization process. If we are left with k’ > 96I-l free bits, then an additional I$ - 9 W ’ bits can be set to zero without violating the invariant. Our first goal is to enforce that all a,s have the same sign and that all b,s have the same sign. To achieve an identical sign for the a,s, we set an appropriate collection of k/2 x-bits to zero. We repeat this procedure for the y-bits by setting 1/2 y-bits to zero, but we also set an additional number of k2/4 y-bits to zero. Thus, with kl
=
k -
2
and
11 =
1 2
k2 4
---
Bhaskar DasGupta and Georg Schnitger
814
+
we have II = [nk (k2/2)]- (k2/4) and therefore 1, = 2trkl + k:. Proposition 3.3 guarantees that the invariant holds for domain(kl.11, <.rl) [where (respectively ‘1) consists of all fixed x-bits (respectively y-bits)]. We now face, after appropriate renumbering positions if necessary, the following situation.
<
(a) Bits XI... . . xkl and y1.. . . .yII are free and (b) In,/ 5 . . . 5 Inkl/ as well as ( b l /5 . . . 5 lbll1.
x- and y-weights are both nonnegative.
Case 1. The
Set Y = k1/2 and s = 2trr
+ r’.
i Let d1 = Cl:kl-r+l a, and
/j2 =
I Cl:I,-q+, b,
+ /jz 2 t. We set the last r bits of x and the last s bits of y to one (i.e., xkl-,+l = ... = xkl = 1 and yll-,+l = . . . = yl, = 1). Since the x- and y-weights are nonnegative, gate g has been trivialized: its output will always be one. Thus kl - r = r free x-bits and I 1 -s free y-bits remain. Observe that Case 1.1. /jl
II - s
+ r 2 ) = 2tukl + k:
=
11 - (2trr
=
2nr + 3r2 = 2(n + Y)Y
-
(20r + r2)
+ r2
and the invariant is satisfied with Proposition 3.3. We are left with r k/4 > k/96 free x-bits.
=
Case 1.2. [jl + M < t. This time we set the last Y bits of x as well as the last s+2r2bits of y to zero. Observe first that Y bits of x and 11-s-2r2 = 2ar+r2 bits of y remain free.
~~”’-’‘
Next observe that, ‘&a, 5 131 and b, 5 jj2 (since 11 -s - 2r2 = 2tur + r2 = s 5 s + 2r2. Hence, gate g will always output zero and thus has been trivialized. The invariant is guaranteed with Proposition 3.3, since (I is unchanged. Again, r = k/4 > k/96 free x-bits remain. Case 2. The x- and y-weights are both nonpositive. The construction is analogous to Case 1. Case 3. The x-weights are nonnegative and the y-weights are nonpositive. Let K = k1/3. Observe that I 1 = 2 ~ ( 3+~ () 3 ~ =) ~6 a ~ 9 ~ ’ . First we partition the indices for the free x-bits into the three classes S, = (1,. . . K } , M, = { K + 1 , .. . , 2 ~ }and , L, = { 2 +~ 1,.. . , 3 ~ } . Analogously, we three-partition the indices for the free y-bits into the sets S, (of the first 2 a ~ 3 ~ y-positions), ’ My (of the second 2 n + ~ 3~~ y-positions), and L, (of the last 2 m + 3 ~y-positions). ; ~ Let r be an integer. We say, that r is legal, provided ~ / 1 65 r 5 K (and thus r 2 k/96). We will make two attempts at trivializing gate g. In both attempts r bits of x (and 2ar + r2 bits of y) will be fixed to one. Also,
+
.
+
Analog vs. Discrete Neural Networks
815
r 2 k/96 bits of x [and 2(a+ r)r + r2 = 2ar + 3r2 bits of yl will remain free. Thus, by Proposition 3.3, the invariant still holds and we have to ensure only that gate g will be additionally trivialized. In the first (second) attempt, we try to force g to be constantly zero (one). Then we show that one of the two attempts has to be successful. In both attempts we only set x-bits with positions in M, and y-bits with D = (l/lMyl)CIEM, b,, positions in My to one. Let I = (l/lMyl)CIEMya,, and w = [ + 2aq. Observe that is nonpositive.
Attempt 1. Trying to set gate g to be constantly zero. To keep the weighted sum of gate g as small as possible, we leave only the r bits of x corresponding to the first r positions of S, free. Also, only the 2ar + 3r2 bits of y, corresponding to the first 2ar + 3r2 positions of L,, are left free as well. We then set x, (for the first r positions I E M,) as well as yI (for the last 2ar + r2 positions i E My) to one. The remaining bits to be fixed are all set to zero. Assume for the moment that all free bits are set to 0. Then the weighted sum of gate g will be upper-bounded by Y
.
+ (2trr + r 2 ) q = r . (< + 2aq) + r2 ‘ 1 7 = r . w + 12 ’71
This follows, since the x-weights are in increasing order and hence the average of the first r weights of My is not bigger than the overall average. Moreover, the y-weights are in decreasing order and hence the average of the last 2ar + r2 weights of My is not bigger than the overall average. How large can the contribution of the free bits be, when (say) r’ free bits of x are set to one? Since the condition 2(a + r)[x]5 [y] 5 (20 +r)[x] + [XI’ has to be satisfied and since we are trying to maximize the weighted sum of gate g, as few y-bits as possible [namely 2(. + r)r’ bits] will be set to one. Thus the contribution of the free bits is at most r’[
+ 2(n + r)r’q
+ 2077 + 2 r . 11) r’ . (LJ + 2 r . q )
= Y’ . (I =
This follows, since the x-weights are in increasing order and hence the average of any Y’ weights of S, is not bigger than the average of the weights in My. Moreover, the y-weights are in decreasing order and hence the average of any 2(cu + r)r’ weights of L, is not bigger than the average of the weights in My. Summarizing, the value of g is either upper-bounded by r . w + r2r1, if J + 2rri 5 0, or by 2r . w + 3r2q otherwise.
Consequently we are done, if we can find a legal value for r such that the respective upper bound is less than t, the threshold value of gate g. But let us assume that our first attempt fails for all legal values of Y.
Bhaskar DasGupta and Georg Schnitger
816
Attempt 2. Trying to set gate g to be constantly one. To make the weighted sum of gate g as large as possible, we leave only the r bits of x corresponding to the first r positions of L , free. Also, the 2ar + 3r2 bits of y, corresponding to the first 2tur + 3r2 positions of S,, are left free as well. We set x, (for the last r positions i E M,)as well as y, (for the first 2ar + r2 positions i E MY) to one. The remaining bits to be fixed are all set to zero. Assume for the moment that all free bits are set to 0. Then the weighted sum of gate g will be lower-bounded by r.E
+ (2nr + r2)r/ = r . LJ + r 2 . rl
How small can the contribution of the free bits be, when (say) r’ free bits of x are set to one? Since the condition 2(tr + r)[x] 5 [y] 5 2 ( n + r)[x]+ [x]’has to be satisfied and since the weighted sum of gate g should be minimized, as many y-bits as possible [namely 2(0 r)r’ (r’)2 bits] will be set to one. And we obtain a contribution of nt least
+ +
Summarizing, the value of g is either lower-bounded by 0 0
2 r . w‘ + 4r2r/, if w‘ + 3rrl 5 0, or by r . + r’r, otherwise.
Consequently we are done, if we can find a legal value for r such that the respective lower bound is greater than or equal to t, the threshold value of gate g. But let us assume that this second attempt also fails. Failure of Both Attempts. Let 11= { r :
w’
+ 3rr/ > O}.
12 = {r :
+ 3rrl 5 0. w‘ + 2rri > O}.
and
Case 3.1. w + 3 ( ~ / 4 ) 15/ 0. Set = ti/2 and observe that ro and 2r0 are legal values for r. As a consequence of the case assumption, we obtain d 2ror/ 5 ~LI 3 ( ~ / 4 ) r5) 0 and hence [rg. 031 C 13. Since both attempts fail,
+
+
t 5 (2rU)w’+ (2ro)’r/ and 2r0d + 4r;q < t But this is impossible and one of the two attempts has to succeed for a legal value of r.
Analog vs. Discrete Neural Networks
817
Case 3.2. LJ + 3(m/4)q > 0. Let r1 = (ti/4) and observe that r1 and (r1/4) are legal values of r. Moreover [-co, rl] C 11. Since both attempts fail, rlw
+ r:q
r
4
+ 3 (45 ) 2 q
As a consequence
We again arrive at a contradiction, and one of the two attempts has to succeed for a legal value of r . Case 4. The x-weights are nonpositive and the y-weights are nonnegative, analogous to Case 3, and the proof of Theorem 3.1 is complete. 0 Remark 3.1. (a) The lower bound of Theorem 3.1 is ”almost” tight. It is quite easy to construct a binary threshold network of O(1og n ) gates that computes the binary representation of [x].Now we apply the SchonhageStrassen multiplication algorithm to obtain a binary threshold network of size O(1ogn.log log n.log log log n ) that computes all the bits of [x]’. Hence SQ,, can be computed with O(1og n.log log n.log log log n ) threshold gates and [y]). (with a final gate comparing (b) A better separation of binary threshold networks and ?-nets might be possible by considering the language L of binary squaring, i.e.,
[XI’
L
=
{(x.y) : x E {0,1}”.y E (0. l}2f’such that
The problem of deriving a superlogarithmic lower bound for networks with weights of unbounded size seems to be difficult however. Acknowledgment Partially supported by NSF Grant CCR-9114545. References Boppana, R., and Sipser, M. 1990. The complexity of finite functions. In Haridbookof Theoretical Computer Science: Algorithms and Complexity, J. van Leeuwen, ed. MIT Press, Cambridge, MA. Dasgupta, B., and Schnitger, G. 1993. The power of approximating: A comparison of activation functions. In Adzlances iri Neirral lriformatioii Processing Systems 5, C . L. Giles, S. J. Hanson, and J. D. Cowan, eds., pp. 615-622, Morgan Kaufmann, San Mateo, CA.
818
Bhaskar DasGupta and Georg Schnitger
Goldmann, M., and Hastad, J. 1991. On the power of small-depth threshold circuits. Cotnp. Complex., 1(2), 113-129. Hajnal, A., Maass, W., Pudlak, I?, Szegedy, M., and Turan, G. 1987. Threshold circuits of bounded depth. Proc. 28th I E E E S!/nzp. Foirrid. Corripiiter Sci.,99-710. Hoffgen, K-U. 1993. Computational limitations on training sigmoidal neural networks. Ztzfortti. Process. Lett., 46, 269-274. Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets. Proc. 25th A C M Syiizp. Theory Cotnput., 335-344. Maass, W., Schnitger, G., and Sontag, E. D. 1991. On the computational power of sigmoid versus boolean threshold circuits. Proc. %?rid Arzizrr. Syrrip. Fourzd. Coiiiputer Sci. 767-776. MacIntyre, A., and Sontag, E. D. 1993. Finiteness results for sigmoidal ‘neural’ networks. Pror. 25th Aiiriir. Syriip. Theor!/ Conipirt. 325-334. Razborov, A. A. 1987. Lower bounds on the size of bounded depth networks over a complete basis with logical addition. Mntk. Nott7s Acad. Sci. USSR 41(4), 333-338. Reif, J. H. 1987. On threshold circuits and polynomial computation. Proc. 2iid Aririir. Sfrirct. Complexity Theor!/ 118-123. Sontag, E. D. 1990. Comparing sigmoids and heavisides. In Proc. Corzf. Iiiforriz. Sri. Syst. 654459. Wegener, I. 1987. TI7e Conip/e.xIty of Boolmri Firrictioris. Wiley-Teubner Series in Computer Science, New York. Wegener, I. 1991. The complexity of the parity function in unbounded fan-in, unbounded depth circuits. Tlimr. Coirip. Sci. 85, 155-1 70. Zhang, X-D. 1992. Complexity of neural network learning in the real number model, preprint, Computer Science Department, University of Massachusetts. ~~
~
~~
Received January 30, 1995, accepted October 9, 1995
This article has been cited by: 2. Jiří Šíma , Pekka Orponen . 2003. General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic ResultsGeneral-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results. Neural Computation 15:12, 2727-2778. [Abstract] [PDF] [PDF Plus]
Communicated by John Platt and Todd Leen
On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions Partha Niyogi Federico Girosi Center for Biological and Computational Learning and Artificial Intelligence Laborato y, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A
Feedforward networks together with their training algorithms are a class of regression techniques that can be used to learn to perform some task from a set of examples. The question of generalization of network performance from a finite training set to unseen data is clearly of crucial importance. In this article we first show that the generalization error can be decomposed into two terms: the approximation error, due to the insufficient representational capacity of a finite sized network, and the estimation error, due to insufficient information about the target function because of the finite number of samples. We then consider the problem of learning functions belonging to certain Sobolev spaces with gaussian radial basis functions. Using the above-mentioned decomposition we bound the generalization error in terms of the number of basis functions and number of examples. While the bound that we derive is specific for radial basis functions, a number of observations deriving from it apply to any approximation technique. Our result also sheds light on ways to choose an appropriate network architecture for a particular problem and the kinds of problems that can be effectively solved with finite resources, i.e., with a finite number of parameters and finite amounts of data.
1 Introduction
Many problems in learning theory can be effectively modeled as learning an input output mapping on the basis of limited evidence of what this mapping might be. The mapping usually takes the form of some unknown function between two spaces and the evidence is often a set of labeled, noisy, examples, i.e., (x,y) pairs that are consistent with this function. On the basis of this data set, the learner tries to infer the true function. The unknown target function is assumed to belong to Neural Computation 8, 819-842 (1996) @ 1996 Massachusetts Institute of Technology
820
Pnrtha Nivogi and Fcderico Cirosi
some class F (the iuiictyJt closs). Typical examples of concept classes are classes of indicator functions, boolean functions, Sobolev spaces, etc. The learner is provided with a finite data set. For our purposes we assume that the data are drawn by sampling independently the input/output space (X i Y) according t o some unknown probability distribution. On the basis of these data, the learner then develops a hypothesis (another function belonging to the Iiypoflicsis class H c 2 3 about the identity of the target function. Hypothesis classes could also be of different kinds. For example, they could be classes of boolean functions, polynomials, multilayer perceptrons, radial basis functions, and so on. I f , as more and morc data become available, the learner's hypothesis becomes closer and closer to the target and converges to it in t h e limit, the target is said to be learnable. The error between the learner's hypothesis and the target function is defined to be the gcricralizafioii cwor and for the target to be learnable the generalization error should go to zero as the data go to infinity. While learnability is certainly a very desirable quality, it requires the fulfillment of two important criteria. First, there is the issue of the representational capacity (or Iiypoflzesis c-mtipkxify) of the hypothesis class. This must have sufficient power to represent or closely approximate the concept class. Otherwise for some target function f E .F, the best hypothesis Ii in H might be far away from it. The error that this best hypothesis makes is formalized later as the O[J;?rO.Yif nn tior I u r o r . Second, we do not have infinite data but only some finite random sample set from which we construct a hypothesis. This hypothesis constructed from the finite data might be far from the best possible hypotliesis, I f , resulting in an additional error. This is formalized later as the estimntiori twor. The amount of data needed to ensure a small estimation error is referred to as the sniiipk complexity of the problem. The hypothesis complexity, the sample complexity, and the generalization error are related. If the class H is very large or in other words has high complexity, then for the same estimation error, the sample complexity increases. If the hypothesis complexity is small, the sample complexity is also small, but now for the same estimation error the approximation error is high. This point has been developed in terms of the bias-variance trade-off by Geman r t nl. (1992). The bias term corresponds to the approximation error, and the variance corresponds to the estimation error. Other authors have discussed this more generally in the statistics literature (Rissanen 1989; Vapnik 1982). The purpose of this paper is two-fold. First, we formalize the problem of learning from examples so as to highlight the relationship between hypothesis complexity, sample complexity, and generalization error. Second, we explore this relationship in the specific context of radial basis function networks (Moody and Darken 1989; Poggio and Girosi 1990; Powell 1992). Specifically, we are interested in asking the following questions about radial basis functions.
Radial Basis Functions
821
Imagine you were interested in solving a particular problem (regression or pattern classificntion) using radial basis function networks. Then, how large must the network be a i d how inaizy examples do you need to draw so that you are guaranteed with high confidence to do very well? Conversely, if you had a finite network and a finite amount of data, what are the kinds of problems you could sol ve efiectively ? Clearly, if one were using a network with a finite number of parameters, then its representational capacity would be limited and, therefore, even in the best case we would make an approximation error. Drawing upon results in approximation theory (Lorentz 1986) several researchers (Cybenko 1989; Barron 1993; Hornik et a / . 1989; Mhaskar and Micchelli 1992; Mhaskar 1993) investigated the approximating power of feedforward networks showing how as the number of parameters goes to infinity, the network can approximate any continuous function. These results ignore the question of learnability from finite data. For a finite network, due to finiteness of the data, we make an error in estimating the parameters and consequently have an estimation error in addition to the approximation error mentioned earlier. Using results from Vapnik and Chervonenkis (Vapnik 1982; Vapnik and Chervonenkis 1971) and Pollard (1984), work has also been done (Haussler 1992; Baum and Haussler 1989) on the sample complexity of finite networks showing how as the data go to infinity, the estimation error goes to zero, i.e., the empirically optimized parameter settings converge to the optimal ones for that class. However, since the number of parameters is fixed and finite, even the optimal parameter setting might yield a function that is far from the target. This issue is left unexplored by Haussler (1992) in an excellent investigation of the sample complexity question. In this article we explore the errors due to both finite parameters and finite data in a common setting. For the total generalization error to go to zero, both the number of parameters and the number of data have to go to infinity, and we provide rates at which they grow for learnability to result. Further, as a corollary, we are able to provide a principled way of choosing the optimal number of parameters so as to minimize expected errors. It should be mentioned here that Barron (1994) and White (1990) have also provided treatments of this problem for different hypothesis and concept classes. The plan of the article is as follows: in Section 2 we provide a general formalization of the problem. We then provide in Section 3 a precise statement of a specific problem along with our main result, whose proof can be found in Niyogi and Girosi (1994). In Section 4 we discuss what could be the implications of our result in practice; we provide several qualifying remarks and a numerical simulation. Finally we conclude in Section 5 with a reiteration of our essential points.
822
Partha Niyogi and Federico Girosi
2 Definitions and Statement of the Problem
To make a precise statement of the problem we first need to introduce some terminology and to define a number of mathematical objects. 2.1 Random Variables and Probability Distributions. Let X and Y be two arbitrary sets, such that an unknown probability distribution P(x.y) is defined on X x Y. We will call x and y the independent uwiable and response, respectively, where x and y range over the generic elements of X and Y. In most cases X will be a subset of a k-dimensional Euclidean space and Y a subset of the real line. The probability distribution P(x.y) can also be written as P(x.y) = P(x)P(ylx), where P(y1x) is the conditional probability of the response y given the independent variable x, and P(x) is the marginal probability of the independent variable. Expected values with respect to P(x.y) or P(x) will be always indicated by €[.I. In practical cases we are provided with exmnples of this probabilistic relationship, that is with a data set D f e { (x,?y,) E X x Y}:=,,obtained by sampling 1 times the set X x Y according to P(x.y). From the definition of P(x>y)we see that we can think of an yI)of the data set DI as obtained by sampling X according to element (xl> P(x), and then sampling Y according to P(y1x). The interesting problem is, given an instance of x that does not appear in the data set DI, to give an estimate of what we expect y to be. Formally, we define an estimator to be any functionf : X -+ Y. Clearly, since the independent variable x need not determine uniquely the response y, any estimator will make a certain amount of error. However, it is interesting to study the problem of finding the best possible estimator, given the knowledge of the data set DI, and this problem will be defined as the problem of learning from examples, where the examples are represented by the data set Df. 2.2 The Expected Risk and the Regression Function. Having defined an estimator, we now need to define a measure of how good an estimator is. Suppose we sample X x Y according to P(x,y), obtaining the pair (x.y). A possible measure of the error of the estimator f at the point x is [y - f(x)I2. The average error of the estimator f is now given by the functional
that is usually called the expected risk off for the specific choice of the error measure. We are now interested in finding the estimator that minimizes the expected risk over some domain F.We will assume in the following that .F is some space of differentiable functions, for example, the space of functions with m bounded derivatives.
Radial Basis Functions
823
Assuming that the problem of minimizing I F ] in 3 is well posed, it is easy to obtain its solution. In fact, the expected risk can be decomposed in the following way (see the Appendix):
where fo(x) is the so called regressionfunction, that is, the conditional mean of the response given the independent variable:
From equation 2.1 it is clear that the regression function is the function that minimizes the expected risk in F,and is therefore the best possible estimator. Hence,
fo(x) = argminI[f] f€ F While the first term in equation 2.1 depends on the choice of the estimator f, the second term is an intrinsic limitation due to the probabilistic nature of the problem, and therefore even the regression function will make an error equal to E[(y -fo(x))’]. The problem of learning from examples can now be reformulated as the problem of reconstructing the regression function fo, given the example set DI. It should also be pointed out that this framework includes pattern classification and in this case the regression (target) function corresponds to the Bayes discriminant function (Gish 1990; Hampshire and Pearlmutter 1990; Richard and Lippman 1991). 2.3 The Empirical Risk. In practice the expected risk IF] is unknown because P(x,y) is unknown, and our only source of information is the data set Dr. Using this data set, the expected risk can be approximated by the empirical risk lemp:
For each given estimator f, the empirical risk is a random variable, and under fairly general assumptions, by the weak law of large numbers (Dudley 1989) it converges in probability to the expected risk as the number of data points goes to infinity: lim P{ II[f] - Iemp[f]I> E }
I-%
= 0 YE> 0
(2.3)
Therefore a common strategy consists in estimating the regression function as the function that minimizes the empirical risk, since it is “close” to the expected risk if the number of data is high enough. However, equation 2.3 states only that the expected risk is “close” to the empirical
Partha Niyogi and Federico Girosi
824
risk for each g u e i i f, and not for all f sbnirltaneoi~sly. Consequently the fact that the empirical risk converges in probability to the expected risk when the number, I, of data points goes to infinity does not guarantee that the minimum of the empirical risk will converge to the minimum of the expected risk (the regression function). As pointed out and analyzed in the work of Vapnik and Chervonenkis (1971, 1991) and Pollard (19841, the notion of iiiiiforiti c ~ ~ i i ~ e r ~ine ~probability ice has to be introduced, and it will be discussed in other parts of this paper. 2.4 The Problem. The argument of the previous section suggests that an approximate solution of the learning problem consists in finding the minimum of the empirical risk, that is solving
However, this problem is often ill-posed, because, for most choices of S, it will have an infinite number of solutions. I n fact, all the functions in .F that interpolate the data points (xI.!yl), that is with the property ( f i x , == y, 1 . . . . . l } will give a zero value for Irmp.This problem is very common in approxiniation/regression theory and statistics and can be ‘ipproached in several ways. A coninion technique consists in restricting the search for the minimum to a smaller set than 3. We consider the case in which this smaller set is a family of p r u m ? r i r firricfioris, that is, a family of functions defined by a certain number of real parameters. The choice of a parametric representation also provides a convenient way to store and manipulate the hypothesis function on a computer. We wdl denote a generic subset of 3 whose elements are parameterized by a number of parameters proportional t o 1 1 , by HjI. Moreover, we will assume that the sets H,, form a nested family, that is, ff I ccH - c . . c HI, c ’ . c H. For example, HI, could be the set of polynomials in one variable of degree I I - 1, radial basis functions with iJ centers, niultilayer perceptrons with I I sigmoidal hidden units. Therefore, we choose as an approximation to the regression function the function jll defincd as
fa,I = arg min It,l,p:f t
H
(2.4)
It should be pointed out that the sets H,, and 3’ have to be matched with each other. One could look at this matching from both directions. For a class :F, one might be interested in an appropriate choice of H,l. Conversely, for a particular choice of H,,, one might ask what classes F can be effectively solved with this scheme. Thus, we see that in principle we would like to minimize I $ ] over the large class 3 obtaining thereby the regression function fo. What we do in practice is to minimize the empirical risk ZL~,,,F[f] over the smaller class
Radial Basis Functions
825
H I , obtaining the function fll,f. Assuming we have solved all the computational problems related to the actual computation of the estimator f,!,,, the main problem now is how good is f,,,? Independently of the measure of performance that we choose when answering this question, we expect to become a better and better estimator as n and 1 go to infinity. In fact, when I increases, our estimate of the expected risk improves and our estimator improves. The case of n is trickier. As n increases, we have more parameters to model the regression function, and our estimator should improve. However, at the same time, because we have more parameters to estimate with the same amount of data, our estimate of the expected risk deteriorates. Thus we now need more data and n and 1 have to grow as a function of each other for convergence to occur. At what rate and under what conditions the estimator f l l , f improves depends on the properties of the regression function, that is on 3, and on the approximation scheme we are using, that is on HI?.
i,.,
2.5 Bounding the Generalization Error. Recall that our goal is to minimize the expected risk I F ] over the set 3. If instead we were to choose our estimator from HI,we would obtain f,, as
However, we can only minimize the empirical risk lemp, obtaining as our real estimate the function f,l,f. Our goal is to bound the distance from ffl.f to fo. If we choose to measure the distance in the L 2 ( P ) metric, the quantity that we need to bound, that we will call generalization error, is
There are two main factors that contribute to the generalization error, and we are going to analyze them separately for the moment. 1. A first source of error is due to the fact that we are trying to approximate an infinite dimensional object, the regression function fo E F,with a function defined by a finite number of parameters. We call this the approximation error, and we measure it by the quantity E[Cfo - f , 1 ) 2 ] . The approximation error can be expressed in terms of the expected risk using the decomposition (equation 2.1) as (2.5) Notice that the approximation error does not depend on the data set DI, but depends only on the approximating power of the class H,,, and can be naturally studied within the framework of approximation theory. In the following we will always assume that it is possible to bound the
Partha Niyogi and Federico Cirosi
826
approximation error as follows:
E[ifo - f i l ) ' I
L
411)
where c(tz) is a function that goes to zero as I I goes to infinity if H is dense in 3. In other words, as the number IZ of parameters gets larger the representation capacity of H,, increases, and allows a better and better approximation of the regression function fo. This issue has been studied by a number of researchers (Cybenko 1989; Hornik et nl. 1989; Jones 1992; Barron 1993; Mhaskar and Micchelli 1992; Mhaskar 1993) in the neural networks community. 2. Another source of error comes from the fact that, due to finite data, we minimize the empirical risk Ie,,,p[f], and obtain f,,,~, rather than minimizing the expected risk I F ] , and obtaining f,,. As the number of data goes to infinity we hope that jjl,1will converge to f,,.Convergence will take place if the empirical risk converges to the expected risk uniforrizly it? probability. The quantity IIemp[f] - I F ]I is called estimation error, and conditions for the estimation error to converge to zero uniformly in probability have been investigated by Vapnik and Chervonenkis (1971, 1991), Pollard (1984), Dudley (1987), and Haussler (1992) . Under a variety of different hypotheses it is possible to prove that, with probability 1 - (1, a bound of this form is valid:
The specific form of w depends on the setting of the problem, but, in general, we expect d ( 1 . 1b~) .to be a decreasing function of 1. However, we also expect it to be an increasing function of n. The reason is that if the number of parameters is large then the expected risk is a very complex object, and then more data will be needed to estimate it. Therefore, keeping fixed the number of data and increasing the number of parameters will result, on the average, in a larger distance between the expected risk and the empirical risk. The approximation and estimation error are clearly two components of the generalization error, and it is interesting to notice, as shown in the next statement and represented in Figure 1, the generalization error can be bounded by a linear combination of the two: Statement 2.1. The follozi~iizginequality holds:
Proof. Using the decomposition of the expected risk (2.11, the generalization error can be written as:
llfo
- jrl,~ll:z(P) =
E[VO - jtz,l)zl
=
- IF01
(2.8)
Radial Basis Functions
827
Figure 1: The outermost circle represents the concept class F. Embedded in this are the nested approximating subsets HI, (hypothesis classes). The target function fo is an element of F,f l l is the closest element of W,, to fo, and fll,/ is the element of H I , which the learner hypothesizes on the basis of data. The arrow with the question mark represents the generalization error, and the other two arrows represent the approximation and estimation errors.
A natural way of bounding the generalization error is as follows:
In the first term of the right-hand side of the previous inequality we recognize the approximation error (equation 2.5). If a bound of the form (equation 2.6) is known for the estimation error, it is simple to show (see Fig. 2) that the second term can be bounded as II[flll - ~ f l l . , l l
5 2 4 . n. 6)
and statement (2.1) follows.
828
Partha Niyogi and Federico Girosi
Figure 2: This picture represents the fact that I[f,,] 5 I&] and that IIr]-lemF[f]I 5 w for all f E H,,. Notice that if the distance between I[f,,]and If,,,l]is larger than 2w, the condition Iemp[f,,,~] 5 I,,,[f,,] is violated, and therefore we must have that IU,I - I ~ , , , ,I I I2w.
2.5.1 A Note O I I Models and Model Conzplexity. From the form of equation 2.7 the reader will realize that there is a trade-off between I I and I for a certain generalization error. For a fixed I, as I I increases, the approximation error ~ ( ndecreases ) but the estimation error d ( I .n. h ) increases. Consequently, there is a certain I I that might optimally balance this tradeoff. Note that the classes H,, can be looked upon as models of increasing complexity and the search for an optimal I I amounts to a search for the right model complexity. One typically wishes to match the model complexity with the sample complexity (measured by how much data we have) and this problem is well studied (Rissanen 1989; Barron and Cover 1991; Efron 1982; Craven and Wahba 1979) in statistics. Broadly speaking, simple models would have high approximation errors but small estimation errors while complex models would have low approximation errors but high estimation errors. This trade-off is also embodied in the so-called bias-variance dilemma as described in Geman ef nl. (1992). So far we have provided a very general characterization of this problem, without stating what the sets F and H,, are, and in the next section we will consider a specific choice for these sets, and we will provide a bound on the generalization error of the form of equation 2.7. 3 Stating the Problem for Radial Basis Functions
In this article we focus our attention on a radial basis functions approximation scheme. This is a hypothesis class defined as follows:
Radial Basis Functions
829
where G is a gaussian function and the PI, t,, and or are free parameters. We would like to understand what classes of problems can be solved ”well” by this technique, where ”well” means that both approximation and estimation bounds need to be favorable. It is possible to show that a favorable approximation bound can be obtained if we assume that the concept class of functions 3 to which the regression function belongs is defined as follows:
Cflf
F
=X
* Gl,.m > k/2. I X / R k 5 M}
(3.2)
Here M is a positive number, X is a signed Radon measure on the Bore1 sets of Rk, and G, is the Bessel-Macdonald kernel, i.e., the inverse fourier transform of Grn(s) = (1 + /I~11*)-~@. The symbol * stands for the convolution operation, IXl,i is the total variation of the measure A. The space 3 as defined in equation 3.2 is the Bessel potential space of order m, L?. If rn is even, this contains the Sobolev space HIn,’ of functions whose derivatives u p to order rn are integrable (Stein 1970). To obtain an estimation bound we need the approximating class to have bounded variation, and we impose the constraint C:L1l/?rl I M. This constraint does not affect the approximation bound, and the two pieces fit together nicely. Thus the set H,, is defined now as the set of functions belonging to L2 such that
f(x)
=
21AG r=l
kliirlIM .
( x - trllL . ) “I
tlERk.
o,tR+,
V i = 1 ..... n
(3.3)
r=1
Having defined the sets HI,and F we remind the reader that our goal is to recover the regression function, that is the minimum of the expected risk over F.What we end up doing is to draw a set of l examples and to minimize the empirical risk Iempover the set HI,, that is to solve the following nonconvex minimization problem:
Assuming now that we have been able to solve the minimization problem of equation 3.4, the main question we are interested in is “how far is fr,,, from fo?”. We give an answer in the next section. 3.1 Main Result. Our main theorem is now stated in a PAC-like formulation:
Theorem 3.1. Let HI, be the class of gaussian RBF networks with k iiiput nodes and n hidden nodes as defined in equation 3.3, and fo be an element of the Bessel
Partha Niyogi and Federico Girosi
830
potential space CCgl(R k ) of order 171 uvth in > k/2 (the class F defined in equation 3 2) Assume that a data set {(xI y,)};=, has bee71 obtained by randomly sanipliiig the function fo in presence of noise, mid that the noise distribution has co7npct support Then, for a77y 0 < h < 1, with probability greater than 1 - h, the fo11ozuing bounif fur the generalization error holds
(3.5)
This theorem is proved by decomposing the total generalization error into an approximation component and an estimation one as in equation 2.7. The bound for the approximation error (the first term in the equation above) can be found in Girosi (1994) and Girosi and Anzellotti (1993), and it is a consequence of the Maurey-Jones-Barron lemma (Jones 1992; Barron 1993). The bound for the estimation error (the second term) has been obtained using ideas from the uniform convergence of empirical estimates to their means (Vapnik 1982). In particular, we have used notions of metric entropy (Pollard 1984) to bound the complexity of the class H I , . The full proof of this theorem is not reported here because of its length, and can be found in Niyogi and Girosi (1994). 4 Implications of the Theorem in Practice: Putting in the Numbers
In Figure 3 we show the bound on the generalization error presented in the previous section as a function of the number of examples ( I ) and the number of basis functions ( n ) . A number of remarks about this figure are in order. 4.1 Rate of Growth of 11 for Guaranteed Convergence. From our Theorem 3.1 we see that the generalization error converges to zero only if n goes to infinity more slowly than I. In fact, if 1 1 grows too quickly the estimation error L(I. n . (l) will diverge, because it is proportional to 11. In fact, setting 17 = I ' , we obtain
lim * , ( I . n . + )= lim I1+'lnI
1-
..I \
L--x
Therefore the condition zero.
I'
< 1 should hold to guarantee convergence to
4.2 Optimal Choice of n. In the previous section we made the point that
the number of parameters n should grow more slowly than the number of data points I, to guarantee the consistency of the estimator f l l 1 . It is quite
Radial Basis Functions
831 -
Generalization
F,
0 Lc k
error
bound
m
0
0
oi E
0 A
ci
6
N 0
0
N A I+
5 F,
Q c
i
0 0
m
M 0 0
0 0
n
Figure 3: The bound on the generalization error derived in theorem (3.1)plotted as a function of the number of examples (I) and the number of basis functions (n). The bound has the form X 3- b[(nkln(nl)- lnb)/I]'/*, and in this picture the parameters have values a = 0.01, b = 0.0006 k = 5, and b = 0.01. For 1 = 100 we show n*, the critical number of nodes after which overfitting occurs. clear that there is an optimal rate of growth of the number of parameters, that, for any fixed amount of data points I, gives the best possible performance with the least number of parameters. In other words, for any fixed 1 there is an optimal number of parameters n*(l)that minimizes the generalization error. That such a number should exist is quite intuitive: for a fixed number of data, a small number of parameters will give a low estimation error w(l;n. 6), but very high approximation error ~ ( n )and , therefore the generalization error will be high. If the number of parameters is very high the approximation error & ( n will ) be very small, but the estimation error w ( l ,n, 6) will be high, leading to a large generalization error again. Therefore, somewhere in between there should be a number of parameters high enough to make the approximation error smaII, but
P‘irtha Niyogi and Fcderico Girosi
s32
Hound
0
0 1 1
Lhc
generalization
50
25
error
75
I1
Figure 4: The bound (3.5) on the generalization error is here plotted as a function of the number of basis functions 1 1 , for different numbers of data points ( I = 50. 100.300). The parameters are the same as in iigure ( 3 ) . Notice how the minima t z ’ ( 1 ) of these curves move as I increases. Note also that the minima are broader for larger I , suggesting that an accurate choice of 17 is less critical when plenty of data is available.
not too high, so that these parameters can be estimated reliably, with a small estimation error. Although the exact form for the generalization error is unknown, we can work with the upper bound equation 3.5, which we plot in Figure 4 as a function of the number of parameters iz for various choices of sample size 1. Notice that for a fixed sample size, the error passes through a minimum. Notice that the location of the minimum shifts to the right when the sample size is increased. To find out exactly what the optimal rate of growth of the network size is, we simply find the minimum of the generalization error as a function of 17 keeping the sample size I fixed. Therefore we have to solve
Radial Basis Functions
833
the equation:
for n as a function of 1. Substituting the bound given in Theorem 3.1 in the previous equation, and ignoring logarithmic factors, we obtain an approximation of the optimal number of parameters n*(I) for a given number of examples 1 behaves as
n*(I)x l"3.
(4.1)
While a fixed sample size suggests the scheme above for choosing an optimal network size, it is important to note that for a certain confidence rate ( h ) and for a fixed error bound, there are various choices of n and I that are satisfactory. Figure 5 shows IZ as a function of 1, in other words ( 1 2 . I ) pairs, which yield the same error bound ( E ) with the same confidence. For any fixed error bound, the region to the right of the minimum is uninteresting because it uses more parameters and data than needed. The narrow region between the minimum and the asymptote is more interesting: if network size is very expensive, fewer parameters can be used at the expense of many more data points. Notice, however, how narrow this region is and how quickly the curve goes to infinity: a very large number of data points is needed to compensate for slightly fewer parameters. 4.3 Remarks. In this section we suggest future work, and make connections with other related research.
4.3.1 Extensions. 1. While we have obtained an upper bound on the error in terms of the number of nodes and examples, it would be worthwhile to obtain lower bounds on the same. Such lower bounds do not seem to exist in the neural network literature to the best of our knowledge. 2, We have considered here a situation where the estimated network, i.e., fll,l, is obtained by minimizing the empirical risk over the class of functions Htl. Very often, the estimated network is obtained by minimizing a somewhat different objective function, which consists of two parts. One is the fit to the data and the other is some complexity term that favors less complex (according to the defined notion of complexity) functions over more complex ones. For example, the regularization approach (Tikhonov 1963; Poggio and Girosi 1990; Wahba 1990) minimizes a cost function of the form N
Pnrtha Niyogi and Federico Girosi
831
Figure 5: This figure shows various choices of i,ri. / i which give the same bound E (3.5)on the generalization error. The three curves correspond to the following three values of E: (0.003,0.004,0.005), as indicated on the graph. The interesting observation is that there are a n infinite number of choices for number o f basis functions and number of data points all of which would guarantee the same bound on the generalimtion error. If (ti'. /* ) are tlie coordinates of the minimum of this curve, /- is the minimum number of points necessary to achieve the error bound E with the optimal number of parameters u * . The asymptote on the curve occurs at 1 1 = and corresponds to the case in which I x and the estimation error is zero.
i,
-
over the class H = U,,>lH,,. Here X is the so called "regularization parameter" and Q f ] is a functional that measures smoothness of the functions involved. Choice of an optimal X is an interesting question in regularization techniques and typically cross-validation or other heuristic schemes are used. 3. Structural risk minimization (Vapnik 1982) is another method to achieve a trade-off between network complexity (corresponding to n in our case) and fit to data. However, it does not guarantee that the architecture selected will be the one with minimal parameterization. In fact, it would be of some interest to develop a sequential growing scheme. Such a technique would at any stage perform a sequential hypothesis test. It would then decide whether to ask for more data, add one more node or simply stop and output the function it has as its f-good hypothesis. I n such a process, one might even incorporate active learning (Angluin
Radial Basis Functions
835
1988; Niyogi 1995) so that if the algorithm asks for more data, then it might even specify a region in the input domain from where it would like to see these data. 4. It should be noted here that we have assumed that the empirical risk Cl=,[yi- f(xi)]’ can be minimized over the class H, and the function fn,l be effectively computed. While this might be fine in principle, in practice only a locally optimal solution to the minimization problem is found (typically using some gradient descent schemes). The computational complexity of obtaining even an approximate solution to the minimization problem is an interesting one, and results from computer science (Judd 1988; Blum and Rivest 1988) suggest that it might in general be NP-hard. 4.3.2 Connections with Other Results. 1. In the neural network and computational learning theory communities results have been obtained pertaining to the issues of generalization and learnability. Some theoretical work has been done (Baum and Haussler 1989; Haussler 1992; Ji and Psaltis 1992) in characterizing the sample complexity of finite sized networks. Of these, it is worthwhile to mention again the work of Haussler (1992) from which this paper derives much inspiration. He obtains bounds for a fixed hypothesis space, i.e., a fixed finite network architecture. Here we deal with families of hypothesis spaces using richer and richer hypothesis spaces as more and more data become available. Others (Levin et al. 1990) attempt to characterize the generalization abilities of feedforward networks using theoretical formalizations from statistical mechanics. Yet others (Botros and Atkeson 1991; Moody 1992; Cohn and Tesauro 1991; Rumelhart ef al. 1991) attempt to obtain empirical bounds on generalization abilities. 2. This is an attempt to obtain rate-of-convergence bounds in the spirit of Barron’s work (1994), but using a different approach. We have chosen to combine theorems from approximation theory [which gives us the O(l/n) term in the rate] and uniform convergence theory (which gives us the other part). Note that at this moment, our rate of convergence is worse than Barron’s. In particular, he obtains a rate of convergence of O{l/n + [nkln(l)]/l}. Further, he has a different set of assumptions on the class of functions (corresponding to our F).Finally, the approximation scheme is a class of networks with sigmoidal units as opposed to radialbasis units and a different proof technique is used. 3. It is worthwhile to refer to the article of Geman et al. (1992) in this journal, which discusses the Bias-Variance dilemma. Using our noand the tation the integrated square bias is defined as B = /If0 - ED,f,1]’ - f , , . ~ )where ~ ] , ED, stands for the integrated variance is V = ED,[(ED,I~,,,~] expected value over all the possible data sets of size 1. Geman et al. (1992) show that the generalization error averaged over Dl can be decomposed as B + V. They show that as the number of parameters increases, the bias
836
Partha Niyogi and Federico Girosi
of the estimator decreases and the variance increases for a fixed size of the data set. From an intuitive point of view, the bias B plays the role of the approximation error llfo - f , 1 12, although their relationship is not clear. In fact, the average estimator ED,fI1,1] differs from f,i, and need not even belong to HI,. The variance V is related to the average estimation error, and it can be shown that both of them are bounded by the quantity ED,Ilf,l - ~ l , , ~Finding ~ ~ z . the right bias-variance trade-off is very similar in spirit to finding the trade-off between network complexity and data complexity. 4. Given the class of radial basis functions we are using, a natural comparison arises with kernel regression (Krzyzak 1986; Devroye 1981) and results on the convergence of kernel estimators. It should be pointed out that, unlike our scheme, gaussian-kernel regressors require the variance of the gaussian to go to zero as a function of the data. Further the number of kernels is always equal to the number of data points and the issue of trade-off between the two is not explored to the same degree. 5. In our statement of the problem, we discussed how pattern classification could be treated as a special case of regression. In this case the function fo corresponds to the Bayes a posteriori decision function. Researchers (Richard and Lippman 1991; Hampshire and Pearlmutter 1990; Gish 1990) in the neural network community have observed that a network trained on a least square error criterion and used for pattern classification was in effect computing the Bayes decision function. This paper provides a rigorous proof of the conditions under which this is the case. 4.4 Empirical Results. The main thrust of this paper is to provide some insight into how overfitting can be studied in classes of feedforward networks and the general laws that govern overfitting phenomena in such networks. How closely do “real” function learning problems obey the the general principles embodied in the theorem described earlier? We do not attempt to provide an extensive answer to this question-but just to satisfy the reader’s curiosity, we now describe some empirical results.
4.4.1 The Experiment. The target function, a k-dimensional function, was assumed to have the following form, which ensures that the assumptions of Theorem 3.1 are satisfied:
Here 2 is a diagonal matrix (C),,, = k IT&,,. The parameters, IT^^. w,.cl} were chosen at random in the following ranges: IT, E [1.7.2.3], w, E [-2.2Ik, c, E [-1.11, E [O.~T], N E [3.20]. Training sets of different sizes, ranging from 1 = 30 to 1 = 500, were randomly generated in the kdimensional cube [-T.7rIk, and an independent test set of 2000 examples
Radial Basis Functions
837
I
0
20
40
60
80
100
Number of Nodes
Figure 6: The generalization error is plotted as a function of the number of nodes of an RBF network (3.1) trained on 100 data points of a function of the type (4.2) in 2 dimensions. For each number of parameters 10 results, corresponding to 10 different local minima, are reported. The continuous lines above the experimental data represent the bound + b[(nkln(nl)- ln6)/1]'/2 of eq. (3.5), in which the parameters a and b have been estimated empirically, and 6 = 1 0 P .
was chosen to estimate the generalization error. Gaussian RBF networks (as in Theorem 3.1) with different numbers of hidden units, ranging from n = 1 to n = 300, were trained using a gradient descent scheme. Each training session was repeated 10 times with random initialization, because of the problem of local minima. We did experiments in 2,4, 6, and 8 dimensions. In all cases the qualitative behavior of the experimental results followed the theoretical predictions. In Figures 6 and 7 we report the experimental results for a two- and six-dimensional case, respectively. We found, in general, that although overfitting occurs as expected, it has a tendency to occur at a larger number of parameters than expected. We attribute that to the presence of local minima, that have the effect of restricting the hypothesis, and suggesting that the "effective" number of parameters (Moody 1991) is much smaller than the total number of parameters. We believe that extensive experimentation is needed to compare the deviation between theory and practice, and the problem of local minima
Partha Niyogi and Federico Girosi
838
'.
I
,
0
20
60
40
80
Number of Nodes
Figure 7: Everything is as in figure (6), but here the dimensionality is 6 and the number of data points is 150. As before, the parameters n and I] have been estimated empirically and fi = lo-'. Notice that this time the curve passes through some o f the data points. However, we recall that the bound indicated by the curve holds under the assumption that the global minimum has been found, and that the data points represent different lon7l minima. Clearly in the figure the curve bounds the best of the local minima.
should be seriously addressed. This is well beyond the scope of the current article, and further research on the matter is planned. 5 Conclusion
_
_
~
.
For the task of learning some unknown function from labeled examples where we have multiple hypothesis classes of varying complexity, choosing the class of right complexity and the appropriate hypothesis within that class pose an interesting problem. We have provided an analysis of the situation and the issues involved and in particular have tried to show how the hypothesis complexity, the sample complexity, and the generalization error are related. We proved a theorem for a special set of hypothesis classes, the radial basis function networks, and we bound the generalization error for certain function learning tasks in terms of the number of parameters and the number of examples. This is equivalent to obtaining a bound on the rate at which the number of parameters
Radial Basis Functions
839
must grow with respect to the number of examples for convergence to take place. Thus we use richer and richer hypothesis spaces as more and more data become available. We also see that there is a trade-off between hypothesis complexity and generalization error for a certain fixed amount of data and our result allows us a principled way of choosing an appropriate hypothesis complexity (network architecture). The choice of an appropriate model for empirical data is a problem of long-standing interest in statistics and we provide connections between our work and other work in the field. Appendix: A Useful Decomposition of the Expected Risk We now show that regression function defined in equation 2.2 minimizes the expected risk, llf].By adding and subtracting the regression function, fo. we see that
By definition of the regression function fo(x), the cross-product in the last equation is easily seen to be zero, and therefore
Clearly, the minimum of I F ] is achieved when the first term is minimum, that is when f(x) = fo(x). In the case in which the data come from randomly sampling a function f in presence of additive noise, Ijfo] = cr2 where g2 is the variance of the noise. When data are noisy, therefore, even in the most favorable case we cannot expect the expected risk to be smaller than the variance of the noise. Acknowledgments We are grateful to V. Vapnik, T. Poggio, and B. Caprile for useful discussions and suggestions. We also wish to thank N. T. Chan for kindly providing the code for the numerical simulations. References Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 319-342.
840
Partha Niyogi and Federico Girosi
Barron, A. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I € € € Tmrzs. Iriforiri. Theory 39(3), 930-945. Barron, A. 1994. Approximation and estimation bounds for artificial neural networks. Mnch. Leorri. 14, 115-133. Barron, A,, and Cover, T. 1991. Minimum complexity density estimation. I € € € Trnris. Tlieor!/ 37(4). Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neumf C o ~ i p 1, . 151-160. Blum, A,, and Rivest, R. L. 1988. Training a three-neuron neural net is NPcomplete. In Proceediiigs oftlie 2 988 Workshop oii Coriipiitatiorznl Leariiirig Tlwury, pp. 9-18. Morgan Kaufmann, San Mateo, CA. Botros, S., and Atkeson, C. G. 1991. Generalization properties of Radial Basis Functions. In Ailzlniices ir7 Neural Ii~forrizntioriProcessir7g S?ystevzs 3, R. Lippmann, J. Moody, and D. Touretzky, eds., pp. 707-713. Morgan Kaufmann, San Mateo, CA. Cohn, D., and Tesauro, G. 1991. Can neural networks do better than the VC bounds? In Adzmriccs iri Neiirnl Iiiformtioii Processing S?ysterns 3, R. Lippmann, J. Moody, and D. Touretzky, eds., pp. 911-917. Morgan Kaufmann, San Mateo, CA. Craven, P., and Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross validation. Nnrner. Mntlz. 31, 377403. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Moth. Corztrol S!yst. Sigrinls 2(4), 303-314. Devroye, L. 1981. On the almost everywhere convergence of nonparametric regression function estimate. A m . Stotist. 9, 1310-1319. Dudley, R. M. 1987. Universal Donsker classes and metric entropy. Aizrz. Proh. 14(4), 1306-1326. Dudley, R. M. 1989. Real AiinlysisnridProbaDility, Mathematics Series. Wadsworth and Brooks/Cole, Pacific Grove, CA. Efron, B. 1982. The Jnckriife, the Bootstrap, nizd Other Xesnrizpliizg Plans. SIAM, Philadelphia. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neurnl Cornp. 4, 1-58. Girosi, F. 1994. Regularization theory, Radial Basis Functions and networks. In Froin Statistics to Neiiral Netzoorks. Theory aiid Patterti Recogizitioii Applicntions, V. Cherkassky, J. H. Friedman, and H. Wechsler, eds. Subseries F, Computer and Systems Sciences, Springer-Verlag, Berlin. Girosi, F., and Anzellotti, G. 1993. Rates of convergence for Radial Basis Functions and neural networks. In Artificial Neitral Networks for Speech aiid Visioii, R. J. Mammone, ed., pp. 97-113. Chapman & Hall, London. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural networks architectures. Neural Coiiip. 7, 219-269. Gish, H. 1990. A probabilistic approach to the understanding and training of neural network classifiers. In Proceedings of the ICASSP-90, pp. 1361-1365, Albuquerque, NM. Hampshire, J. B. 11, and Pearlmutter, B. A. 1990. Equivalence proofs for multi-
Radial Basis Functions
841
layer perceptron classifiers and the bayesian discriminant function. In Proceedings of the 1990Connectionist Models Summer School, J. Elman, D. Touretzky, and G. Hinton, eds. Morgan Kaufmann, San Mateo, CA. Haussler, D. 1992.Decision-theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation 100(1), 78-150. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Ji, C., and Psaltis, D. 1992. The VC dimension versus the statistical capacity of multilayer networks. In Advances in Neural lnformation Processing Systems 4, S. J. Hanson, J. Moody, and R. P. Lippman, eds., Morgan Kaufmann, San Mateo, CA. Jones, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for Projection Pursuit Regression and neural network training. A n n . Statist. 20(1), 608-613. Judd, S. 1988.Neural network design and complexity of learning. Ph.D. thesis, University of Massachusetts, Amherst, Amherst, MA. Krzyzak, A. 1986. The rates of convergence of kernel regression estimates and classification rules. ZEEE Trans. Inform. Theory IT-32(5),668-679. Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered neural networks. Proc. 1EEE 78(10),1568-1574. Lippmann, R. P. 1987. An introduction to computing with neural nets. ZEEE A S S P Mag. April, 4-22. Lorentz, G. G.1986.Approximation of Functions. Chelsea Publishing, New York. Mhaskar, H. N. 1993. Approximation properties of a multilayered feedforward artificial neural network. Adi? Comput. Math. 1, 61-80. Mhaskar, H. N., and Micchelli, C. A. 1992. Approximation by superposition of a sigmoidal function. Ado. A p p l . Math. 13, 350-373. Moody, J. 1991. The effective number of parameters: An analysis of generalization and regularization in non-linear learning systems. In Advances in Nrurd Ziiformation Processing Systems 4, S. J. Hanson, J. Moody, and R. P. Lippman, eds., pp. 847-854. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Conzp. 1(2),281-294. Niyogi, P. 1995. The informational complexity of learning from examples. Ph.D. thesis, MIT, Cambridge, MA. Niyogi, P., and Girosi, F. 1994. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. A.I. Memo 1467, Massachussetts Institute of Technology, 1994. URL ftp:/ /publications.ai.mit.edu/ai-publications/l000-1499/AIM-l467.ps.Z. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Pror. l E E E 78(9). Pollard, D.1984. Cotmergence of Stochastic Processes. Springer-Verlag, Berlin. Powell, M. J. D. 1992. The theory of radial basis functions approximation in 1990. In Advances in Numerical Analysis Volume Zl: Wauelets, Siihdiuisioiz AIgorithms and Radial Basis Funcfions, W. A. Light, ed., pp. 105-210. Oxford University Press, Oxford, England.
Partha Niyogi and Federico Girosi
842
Richard, M. D., and Lippman, R. P. 1991. Neural network classifier estimates bayesian a-posteriori probabilities. Neiirnl Corirp. 3, 461-483. Rissanen, J. 1989. Stnchnsfic Cornplrsity irr Stntisticnl Iriqiiiry. World Scientific, Singapore. Stein, E. M. 1970. Sirrytrlnr Iiitegrnls nird Differerrtiability Properties of Fuizctioris. Princeton University Press, Princeton, NJ. Tikhonov, A. N. 1963. Solution of incorrectly formulated problems and the regularization method. Sazliet Moth. Dokl. 4, 1035-1038. Vapnik, V. N. 1982. Estiriintioti of Depeirtfuiicies Bosed oil Ertrpiricnl Dntn. SpringerVerlag, Berlin. Vapnik, V. N., and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Th. Prob. Appl. 17(2), 264-280. Vapnik, V. N.,and Chervonenkis, A. Y. 1991. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Patt. Rrcoigri. [ m y e Arrnlysis 1(3), 283-305. Wahba, G. 1990. Splirie Models for Obserzntiorinl Dntn, Series in Applied Mathematics, Vol. 59. SIAM, Philadelphia. Weigand, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight elimination with applications to forecasting. In Adzvvicts itr Neural Irtforrrintiori P r o w s s i q Systrt~ls3, R. Lippmann, J. Moody, and D. Touretzky, eds., Morgan Kaufmann, San Mateo, CA. White, H. 1990. Connectionist nonparametric regression: Multilayer percep~ r535-549. k~ trons can learn arbitrary mappings. Nt.urnl N ~ t i ~ l 3, -
~
Received September 15, 1995, accepted hoxember 2, 1993
This article has been cited by: 2. M. Baglietto, C. Cervellera, M. Sanguineti, R. Zoppoli. 2010. Management of water resource systems in the presence of uncertainties by nonlinear approximation techniques and deterministic sampling. Computational Optimization and Applications 47:2, 349-376. [CrossRef] 3. Vidya Bhushan Maji, T. G. Sitharam. 2008. Prediction of Elastic Modulus of Jointed Rock Mass Using Artificial Neural Networks. Geotechnical and Geological Engineering 26:4, 443-452. [CrossRef] 4. Cristiano Cervellera, Marco Muselli. 2007. Efficient sampling in approximate dynamic programming algorithms. Computational Optimization and Applications 38:3, 417-443. [CrossRef] 5. Yiming Ying. 2007. Convergence analysis of online algorithms. Advances in Computational Mathematics 27:3, 273-291. [CrossRef] 6. D.G. Khairnar, S.N. Merchant, U.B. Desai. 2007. Radial basis function neural network for pulse radar detection. IET Radar, Sonar & Navigation 1:1, 8. [CrossRef] 7. Ding-Xuan Zhou, Kurt Jetter. 2006. Approximation with polynomial kernels and SVM classifiers. Advances in Computational Mathematics 25:1-3, 323-344. [CrossRef] 8. Jih-Gau Juang, Kai-Chung Cheng. 2006. Application of Neural Networks to Disturbances Encountered Landing Control. IEEE Transactions on Intelligent Transportation Systems 7:4, 582-588. [CrossRef] 9. Qiang Wu , Ding-Xuan Zhou . 2005. SVM Soft Margin Classifiers: Linear Programming versus Quadratic ProgrammingSVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming. Neural Computation 17:5, 1160-1187. [Abstract] [PDF] [PDF Plus] 10. Ding-Xuan Zhou. 2003. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory 49:7, 1743-1752. [CrossRef] 11. Sayan Mukherjee, Pablo Tamayo, Simon Rogers, Ryan Rifkin, Anna Engle, Colin Campbell, Todd R. Golub, Jill P. Mesirov. 2003. Estimating Dataset Size Requirements for Classifying DNA Microarray Data. Journal of Computational Biology 10:2, 119-142. [CrossRef] 12. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 13. Xiaohong Chen, J. Racine, N.R. Swanson. 2001. Semiparametric ARX neural-network models with an application to forecasting inflation. IEEE Transactions on Neural Networks 12:4, 674-683. [CrossRef]
14. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 15. A. Krzyzak, T. Linder. 1998. Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks 9:2, 247-256. [CrossRef] 16. Yali Amit , Donald Geman . 1997. Shape Quantization and Recognition with Randomized TreesShape Quantization and Recognition with Randomized Trees. Neural Computation 9:7, 1545-1588. [Abstract] [PDF] [PDF Plus] 17. Jeong-Woo Lee, Jun-Ho Oh. 1997. Hybrid Learning of Mapping and its Jacobian in Multilayer Neural NetworksHybrid Learning of Mapping and its Jacobian in Multilayer Neural Networks. Neural Computation 9:5, 937-958. [Abstract] [PDF] [PDF Plus] 18. S. Ridella, S. Rovetta, R. Zunino. 1997. Circular backpropagation networks for classification. IEEE Transactions on Neural Networks 8:1, 84-97. [CrossRef]
Communicated by Andreas Weigend and Chris Bishop
Using Neural Networks to Model Conditional Multivariate Densities Peter M. Williams School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton BN1 9QH, England
Neural network outputs are interpreted as parameters of statistical distributions. This allows us to fit conditional distributions in which the parameters depend on the inputs to the network. We exploit this in modeling multivariate data, including the univariate case, in which there may be input-dependent (e.g., time-dependent) correlations between output components. This provides a novel way of modeling conditional correlation that extends existing techniques for determining input-dependent (local) error bars. 1 Introduction
Neural networks provide a way of modeling the statistical relationship between an independent variable X and a dependent variable Y. For example, X could be financial data up to a certain time and Y could be a future stock index, exchange rate, option price etc. Alternatively, X could represent geophysical features of a prospect and Y could represent mineralization at a certain depth. In general, X and Y can be vectors of continuous or discrete quantities. Suppose that the conditional distribution of Y belongs to a family of distributions characterized by a finite set of parameters that are functions of conditioning values of X. These functions, which in general will be nonlinear, can then be modeled by a neural network. For discrete distributions this approach is exemplified in the softmax rule (Bridle 1990). The use of network outputs to set the parameters of a density model forms the basis of the competing local experts model of Jacobs et al. (1991). The idea of using neural networks to return the complete conditional density of the output is also found in Ghahramani and Jordan (1994), for example. Bishop (1994) gives a systematic exposition of this approach, in particular for the case of gaussian mixtures. The case of a single kernel is treated independently by Weigend and Nix (1994) and Nix and Weigend (1995)where the output from an auxiliary variance unit is used to set local time-dependent error bars for time-series predictions. The purpose of the present paper is to extend these techniques to the Neurul Computation 8, 843-854 (1996)
@ 1996 Massachusetts Institute of Technology
844
Peter M. Williams
case of multivariate data where the conditional covariance matrix may be nondiagonal. 2 Multivariate Data
The conditional distribution of the n-dimensional quantity Y given X = s is assumed to be described by the multivariate gaussian density ~ ( I s) y = (2.)-1~/~ISI-"' exp { - i ( y
-
/ c ) T ~ - ' ( y- p ) }
(2.1)
where p ( x ) is the vector of conditional means and S(x) is the conditional covariance matrix. Both /r and S are understood to be functions of x in a way that depends on the outputs of a neural network when the conditioning vector x is given as input. It is assumed that the network has linear output units and that 11 and S are determined by the activations of these units. We now discuss the link between network outputs and the components of p and S.The mean presents no problem. The network will be required to have 12 output units whose activations, {zf'} say, are related to the IZ components of 1-1 by /cl
= 2:'
i = l . . . . .n
(2.2)
These units compute the components of the mean directly. It is less obvious how to represent the covariance matrix. Being symmetric, S has at most n ( n 1 ) / 2 independent entries but it must also be positive definite.' The problem is to parameterize the class of symmetric positive definite matrices in such a way that (1) the parameters can freely assume any real values, (2) the determinant is a simple expression of the parameters, and (3) the correspondence is bijective. To solve this problem we recall the Cholesky factorization of a symmetric positive definite matrix as ATA where A is upper triangular with strictly positive diagonal elements. The square root of the determinant of ATA is the product of the diagonal elements of A. Conversely, if A is any upper triangular matrix with strictly positive diagonal entries, ATA is symmetric positive definite and the correspondence is unique.2 Applying this factorization to the inverse covariance matrix when n = 4, for example, gives
+
'We restrict to the proper case where C is invertible. 2The diagonal entries of A are the square roots of the pivots under gaussian elimination (Horn and Johnson 1985; Golub and Loan 1989). Note that every positive definite matrix is invertible, the inverse of a positive definite matrix is positive definite, and every symmetric positive definite matrix is the covariance matrix of some multivariate gaussian.
Conditional Multivariate Densities
845
with
lxl-’’Z
=
(1’1 (Yz2 “33 (144
To represent the matrix A we stipulate that the network is provided with an additional set of dispersion output units whose activations ( 2 : ) and {z;;} are related to the elements of A by
a,,= expz: Ck,\
= 2;;
i = l . . .. . I ?
(2.3)
i = l .....n - 1 .
j = 2 . . . . .17.
i<j
(2.4)
In this way IZ network outputs are needed for the mean (equation 2.2), another 7.1 for the positive diagonal entries (equation 2.31, and n ( n - 1)/2 for the off-diagonal entries (equation 2.4) making n ( n + 3 ) / 2 in alL3 Note that C can be recovered by inverting X-’,which is easy to compute now that C-’ is known as the product ATA of lower and upper triangular matrices (Press et a / . 1992, Ch. 2). Note also that if X is a vector of independent standard normal deviates, then Y = A-’X is gaussian with covariance matrix C. This can be used for generating efficient Monte Carlo simulations. The use of the exponential in equation 2.3 ensures that the diagonal entries are always positive so that every possible network output vector corresponds to a unique multivariate gaussian. When all network outputs vanish, for example, the corresponding gaussian has zero mean and unit covariance matrix. This representation also prevents variances going to zero, especially in the presence of suitable regularization. The particular choice of the exponential can be distinguished in a Bayesian framework by consideration of uninformative priors for scale parameters, assuming network outputs zT have uniform distributions (Nowlan and Hinton 1992b). 3 Likelihood
Suppose N pairs of corresponding observations { (x,.yr,) : p = 1. . . . .N} have been made on X and Y. The negative conditional log likelihood of the data is assumed to factorize as N
E=CE,
(3.1)
p=l
where from equation 2.1 the negative log likelihood of an individual observation is
E,
=
;1% IC,l
+ ;(Yp
-
/%I Tc,
-1
(YP - PP)
(3.2)
3Network output activations are likely to be stored in a one-dimensional structure for most implementations. It is left to the reader how to manage the two-dimensional indexing.
Peter M. Williams
846
apart from a constant.4 Maximum likelihood estimation would seek network weights 7t1 that minimize E. Whatever form of estimation is used, with or without some form of regularization, the gradient of equation 3.1 with respect to network weights is of interest. Concentrating on equation 3.2 and omitting the subscript y we define
The negative log likelihood for an individual observation is then
and partial derivatives with respect to network outputs are easily seen to be
i = l .... > n - l >
j = 2 . . . . . ii.
i<j
These expressions can be used with backpropagation to calculate VE with respect to network weights. 3.1 Regularization. Since neural networks are universal approximators, care is needed to match the complexity of the model to the information content of the data. Overfitting would take a particularly extreme form in the present case if the model were to fit a gaussian with arbitrarily small variance to one or more data points. It is therefore important to ensure appropriately limited variation over the training set of the modeled covariance matrix. This can be achieved by suitably limiting the number and sizes of the weights in the network. The general technique used below is described in Williams (1995), although other methods are also applicable (Nowlan and Hinton 1992a; Bishop 1993). 3.2 Constant Dispersion. It is interesting to consider the special case in which the network weights attached to the dispersion output units vanish. This would be appropriate if the noise distribution were constant over the whole training set. In that case one would expect an adequate regularizer to detect this feature of the dataset and set the dispersion 41t will not be investigated under what assumptions this factorization over observations is justified. It is sufficient that the observation pairs are jointly independent, but this is not necessary (see Section 4.2).
Conditional Multivariate Densities
847
output weights to zero. However this case may arise, the activations { z r } and { z ; } are then independent of network inputs and determined just by the biases on the corresponding output units. It can then be shown that, at any local minimum of E as a function of weights and biases, the dispersion output biases must assume values such that the inverse of ATA is given by (3.3) where pp is the conditional mean for input x, as computed by the network S for each C, in equations 3.1 and at this local r n i n i m ~ m .Substituting ~ 3.2 leads to
+
E = iNlog IS1 constant
(3.4)
as the expression for the negative log likelihood, permitting dispersion output units to be dispensed with. In the case of univariate data, or more generally of uncorrelated multivariate data, equation 3.4 can also be obtained by integrating out the diagonal elements of the covariance matrix using an uninformative prior (Buntine and Weigend 1991; Williams 1995). The present approach, however, is more flexible in allowing dispersion to vary over the input domain and, even in the case of constant dispersion for multivariate data, more efficient than tackling equation 3.4 directly. 4 Examples
We now illustrate these ideas applied first to synthetic data and then to empirically generated time series data. We begin with computergenerated data for which the generating distribution is known. 4.1 Synthetic Data. Weigend and Nix (1994) discuss univariate data (M = 1) drawn from normal distributions N(pL, cr) with means
p ( x ) = sin(2.5x) sin(l.5x)
and variances
d ( x ) = 0.01 + 0.25 [l - sin(2.5x)I2 jThe proof follows the lines of the usual treatment of maximum likelihood estimators of parameters of multivariate normal distributions, together with their invariance under invertible reparameterizations (Anderson 1958). It should be noted, however, that equation 3.3 is only the maximum likelihood estimate of the covariance matrix if the pp are themselves maximum likelihood estimates of the means, which depends on the style of regularization in force. Nonetheless, since maximum likelihood estimators of variance are biased, this raises the possibility of bias in the present estimators. This issue, especially in the case of larger dimensional data and smaller samples, will be the subject of further investigation.
Peter M. Williams
848
............................
0.00
0.79
1.57
2.36
3.14
Figure 1: Training set for the univariate case showing the random distribution of training data around the mean p [ x ) = sin(2.5~) s i n ( l . 5 ~with ) variance n 2 ( x )= 0.01 + 0.25 [l - sin(2.5s)I2for 0 5 x 5 TT.
One thousand training examples were generated using this example with
x drawn randomly from a uniform distribution on [O.T]. The training set is shown in Figure 1. Results are shown in Figure 2. These were obtained using a simple fully connected 3-layer network with 1 input unit, 10 hidden units, and 2 output units. Networks were trained using the optimization and regularization algorithms of Williams (1991, 19951, which pruned the network to 6 hidden units with 23 remaining nonzero weights and biases. Weigend and Nix, in fact, propose a more complex architecture and training regime. This seems not to be needed by present methods, which fit both first and second moments together and appear to give improved results.‘
To investigate variability between local minima, 20 similar networks were trained and the results averaged. For the mean this gives 11 = ( i l k ) and for the variance u2 = ( n f ) + {(/,f) - ( / L A ) ’ } where / ~ A ( xand ) t$(x) are the mean and variance for the kth network, k = 1 . .. . .20, and ( / L A ) is the average of the means, etc. The results for / I ( x ) and O ( X ) for the mixture are indistinguishable at this scale from those shown in Figure 2. This form of averaging corresponds to rudimentary integration of the predictive distribution over weight space (Buntine and Weigend 1991; Neal 1995).
Conditional Multivariate Densities
849
Standard deviation
1.2 1 0.8
0.6 0.4 0.2 - I- .4
0
0.00 0.79 1.57 2.36 3.14
0.00 0.79 1.57
2.36 3.14
Figure 2: Neural network fit for univariate data using a 3-layer network. Continuing this example we consider data drawn from the bivariate normal distribution ( n = 2) with mean ( P I . p z ) and covariance matrix
“‘y)
(
where the means are given by pI(x) = sin(2.5x)sin(l.5~) p2(x) =
cos(3.5x) cos(o.5x)
the variances by n:(x)
=
CJ:(X)
=
0.01 + 0.25 [l - sin(2.5x)I2 0 01 + 0.25 [l - C O S ( ~ . ~ X ) ] ~
and the correlation coefficient by p ( x ) = sin(2.5x) cos(0.5x)
Three thousand training examples were generated with x randomly distributed over [O, n].’These were modeled using a fully connected 3-layer network with 1 input unit, 20 hidden units, and 5 output units (2 for the means and 3 for the inverse covariance matrix). As an effect of regularization, these were pruned to 12 hidden units with 62 nonzero weights and biases. Results are shown in Figure 3. These show a reasonable fit for most of the interval. 'Specifically yl,y2 were generated a s y ~= p ~ + a(n<1+tj<2) ~ and !/z = / ( 2 + 0 2 ( o E 1 - j& where (I* = (1 p ) / 2 and ,-I2 = (1 - p ) / 2 with (1,
+
Peter M. Williams
850
Standard deviations
MeanS
1.2 1
0.8 0.6 0.4 0.2
0.00
0.79
1.57
2.36
0.00
3.14
0.79
1.57
2.36
3.14
Correlation 1
0.5
0
-0.5
-1
0.00
0.79
1.57
2.36
3.14
Figure 3: Neural network tit for bivariate data 4.2 Empirical Data. For an empirical application we consider high frequency financial data relating to worldwide U.S. dollar exchange rates for the Deutsche mark and Japanese yen. The data were subsampled using the bid price of the last quote of each hour between 1 October 1992 and 30 September 1993 excluding weekends.' The aim was to model the conditional correlation between the logarithmic returns on the two exchange rates rp and I { . This requires modeling the conditional distribution P ( X , 1 Xf-, . . . . . Xo)of the bivariate quantity X,= (D,. I f ) ,where hData collected by Olsen & Associates, Zurich. For present purposes a weekend is defined t o be the 48-hr period beginning midnight GMT on a Friday night.
Conditional Multivariate Densities
Dt = log(rp/rL,) and currencies.’ It was assumed that
851
= log(r;/d-,)
P ( X f I Xf-1.. . . . Xo) = P(Xf
are the hourly returns for the two
I Xf-l..
... X ~ - T )
(t
2T)
(4.1)
where T is some sufficiently large number of time lags. The significance of equation 4.1 is that the likelihood of the full sequence of N observations then factorizes by means of the relation N-l
P ( X N - I . . . . . Xo)= P ( X p 1 . . . . .X,)
P(Xt 1 Xt-1.. . . XI-T)
(4.2)
t=T
The first term on the right is not modeled and can be considered as a constant. Each conditional distribution P ( X t 1 X t - l , . . . Xf-T) was modeled as bivariate normal. For definiteness T was taken to be 12 so there were 24 inputs corresponding to time-lagged returns on the two currencies. Four additional time-dependent inputs were used to detect possible daily or weekly periodicities in the autocorrelations. These consisted of sin Bd, cos ed, sin e,, cos0, where 0, = 27rt/24 and 0, = 27rt/120. A three-layer network was used with 20 hidden units, and 5 output units. The training set consisted of 6251 observations. Results restricted to the week beginning 1.00 A . M . GMT, Monday 23 November, 1992 are shown in Figure 4. It is noticeable that the covariance of the two currencies is consistently positive and that the mark is typically more volatile than the yen. Volatility peaks around 2.00 P.M. during the contemporaneous opening of the European and American markets. The trough at approximately 4.00 A . M . corresponds to the Japanese lunch break. This illustration concerns just 2 variables. We now give an example using 4 variables. This relates to weekly dollar exchange rates on the Deutsche mark, the British pound, the Japanese yen, and the Netherlands guilder for the 10 years to the end of December 1992. Again the consequence (equation 4.2) of assumption 4.1 was used with X,now having four components. In this case T was taken to be 8 so that the network had 32 inputs. There were no explicitly time dependent inputs used in this case. Twenty hidden units were provided together with the 14 output units needed to model the means and covariances. Figure 5 shows the results for 4 of the 6 off-diagonal correlation coefficients for the years 1991 and 1992. The high correlation between the guilder and the mark is evident. Correlations between pound and guilder and between yen and guilder effectively duplicate those with the mark and are, therefore, not shown. Note that the British pound is quoted as a sterling-dollar rate, hence the two negative correlations. Their parallel movement shows that fluctuations are largely dollar determined. >
9This example is for illustration only. In practice there would probably be greater interest in distributions conditioned on a variety of other factors, including risk-free interest rates, for example.
852
Peter M. Williams
I
Figure 1:Conditional variances and covariance *: loh of hourly returns on the Deutsche mark a n d Japanese yen for 23-27 Ncnzember 1992. These applications used the regularization techniques of' Williams (1995) with the effect that the finally pruned networks used around 400 nonzero weights in the case of the hourly data and around 100 in the case of the weekly data. Results were based on a sample of 20 networks with the final results derived from an equally weighted mixture of the resulting multivariate gaussians (compare footnote 6 ) . It should be noted that the displayed volatilities and correlations are relative to the means fitted by the model. An alternative would have been to enforce zero means. In practice, the fitted means differ only negligibly from zero. 5 Conclusion
In many applications of neural networks to time series data it is common to use lagged values of a number of series as inputs to the network, but to pick just the next element from one of the series as the target value. At the same time it is implicitly assumed when training the network that the likelihood factorizes over observations, so that the error to be minimized is a simple sum over samples. But this is not correct if the observations are correlated; and it is specifically the possibility of correlation that motivates the inclusion of multiple series as conditioning variables.
Conditional Multivariate Densities
Dec90
Jun91
853
Dec91
Jun92
Dec92
Figure 5: Conditional correlation coefficients for weekly returns on the Deutsche mark (D) the British pound (G)the Japanese yen (J) and the Netherlands guilder (N) for 1991 and 1992. Relation 4.2 allows these cases to be treated in a rigorous way provided there is a willingness to model the full conditional distribution of the multivariate vector X,.Modeling conditional correlation is, in any case, a subject of interest in its own right and the present methods, which have already been applied successfully to a dozen or more variables jointly, provide a new and effective approach.
Acknowledgment
I a m indebted to Carol Alexander for valuable advice on the financial data.
References Anderson, T. W. 1958. A n introduction to Multivariate Statistical Analysis. John Wiley, New York. Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feedforward networks. IEEE Trans. Neural Netulorks 4(5), 882-884. Bishop, C. M. 1994. Mixture Density Networks. Neural computing research group report NCRG/4288, Aston University.
854
Peter M. Williams
Bridle, J. S. 1990. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocoinpihng: Algorithms, Arclzitectiires and Applications, F. Fogelman Soulie and J. Herault, eds. Springer-Verlag, Berlin. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Coinplex Syst. 5, 603-643. Ghahramani, Z., and Jordan, M. I. 1994. Supervised learning from incomplete data via an EM approach. In Adzlances in Neural Infornzatioii Processing Systeins 6 , J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 120-127. Morgan Kaufmann, San Mateo, CA. Golub, G. H., and Loan, C. F. V. 1989. Matrix Coinputations, 2nd ed. The Johns Hopkins University Press, Baltimore, MD. Horn, R. A., and Johnson, C. R. 1985. Matrix Analysis. Cambridge University Press, Cambridge, England. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neumi Comp. 3, 79-87. Neal, R. M. 1995. Bayesian learning for neural networks. Ph.D. thesis, Graduate Department of Computer Science, University of Toronto. Nix, D. A., and Weigend, A. S. 1995. Learning local error bars for nonlinear regression. In Adzmces in Neural Information Processing S y s t e m 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 489496. MIT Press, Cambridge, MA. Nowlan, S. J., and Hinton, G. E. 1992a. Adaptive soft weight tying using Gaussian mixtures. In Advances in Neiirnl Infornintion Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds., pp. 993-1000. Morgan Kaufmann, San Mateo, CA. Nowlan, S. J., and Hinton, G. E. 1992b. Simplifying neural networks by soft weight-sharing. Neural Comp. 4, 473-493. Press, W. H., Flannery, B. I?, Teukolsky, S. A,, and Vetterling, W. T. 1992. Numerical Recipes in C, 2nd ed. Cambridge University Press, Cambridge, England. Weigend, A. S., and Nix, D. A. 1994. Predictions with confidence intervals (local error bars). Proc. Int. Conf. Neural Inform. Process., Seoul, Korea, 847-852. Williams, P. M. 1991. A Marquardt Algorithm for Choosing the Step-Size in Backpropagation Learning with Conjugate Gradients. Cognitive Science Research Paper CSRP 229, University of Sussex. Williams, P. M. 1995. Bayesian regularization and pruning using a Laplace prior. Neural Comp. 7, 117-143.
Received March 8, 1995; accepted October 9, 1995.
This article has been cited by: 2. Ichiro Takeuchi, Kaname Nomura, Takafumi Kanamori. 2009. Nonparametric Conditional Density Estimation Using Piecewise-Linear Solution Path of Kernel Quantile RegressionNonparametric Conditional Density Estimation Using Piecewise-Linear Solution Path of Kernel Quantile Regression. Neural Computation 21:2, 533-559. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Eric A. Stützle, Tomas Hrycej. 2005. Numerical method for estimating multivariate conditional distributions. Computational Statistics 20:1, 151-176. [CrossRef] 4. M. Magdon-Ismail, A. Atiya. 2002. Density estimation and random variate generation using multilayer networks. IEEE Transactions on Neural Networks 13:3, 497-520. [CrossRef] 5. A. Sarajedini, R. Hecht-Nielsen, P.M. Chau. 1999. Conditional probability density function estimation with sigmoidal neural networks. IEEE Transactions on Neural Networks 10:2, 231-238. [CrossRef] 6. N.W. Townsend, L. Tarassenko. 1999. Estimations of error bounds for neural-network function approximators. IEEE Transactions on Neural Networks 10:2, 217-230. [CrossRef] 7. Tom Heskes . 1998. Bias/Variance Decompositions for Likelihood-Based EstimatorsBias/Variance Decompositions for Likelihood-Based Estimators. Neural Computation 10:6, 1425-1433. [Abstract] [PDF] [PDF Plus]
Communicated by John Platt and Andreas Weigend
Pruning with Replacement on Limited Resource Allocating Networks by F-Projections Christophe Molina Mahesan Niranjan Cambridge University Engineering Department (CUED), Trumpington Street, Cambridge CB2 1PZ, England
The principle of F-projection, in sequential function estimation, provides a theoretical foundation for a class of gaussian radial basis function networks known as the resource allocating networks (RAN). The ad hoc rules for adaptively changing the size of RAN architectures can be justified from a geometric growth criterion defined in the function space. In this paper, we show that the same arguments can be used to arrive at a pruning with replacement rule for RAN architectures with a limited number of units. We illustrate the algorithm on the laser time series prediction problem of the Santa Fe competition and show that results similar to those of the winners of the competition can be obtained with pruning and replacement. 1 Introduction
Nowadays, there is a lot of interest in neural network architectures whose size can be adapted. Approaches in this direction are characterized by ad hoc rules based on thresholding some parameter and the hope that such networks will find solutions in which the model complexity somehow matches that of the data. One such approach is the resource allocating network (Platt 1991). The RAN is a gaussian radial basis function network (GaRBF). It sequentially processes data and, based on two thresholds (a prediction error threshold and a data novelty threshold), either adapts the existing network parameters or grows the network. The individual basis functions of the GaRBF models operate essentially in a local manner, making this model an ideal candidate for such adaptation of architecture. Growing and pruning in the RAN is related to a class of pruning algorithms that estimates the sensitivity of an objective function to removal of an element of the network (Reed 1993). Within this class, optimal brain damage (LeCun et a!. 1990) calculates the saliency Sj of weights w,in a parameter space as
Neural Computation 8, 855-868 (1996)
@ 1996 Massachusetts Institute of Technology
Pruning on LRAN by F-Projections
856
where tlie objective function E is tlie mean squared error over a training set described by
I = {(x,l.y!l) : X I , €
2,
cr P;!h E %}
(1.2)
and generated by an underlying function F ( x )= y. Similarly, skeletonization (Mozer and Smolensky 1989) reduces pruning to the estimation in the function space of the relevance of a unit k as /J;
=
E\\itlil>uiunit A
-
E\\.~~ ilI n tIt I,
(1.3)
where the objective function E is tlie mean absolute error over the training samples. Both techniques train a feedforward network until a reasonable solution has been obtained, then prune an element in terms of its saliency or relevance, followed by retraining. This requires the total change in error E to be calculated for every unit or weight in the network, which is a computational-expensive task. A better objective function for the estimation of the relevance of a unit in the function space is the Lz-norm of the distance between the neural network mapping f and the underlying problem F,given by
This objective function requires units with finite Lz-nornl. A radial basis function network f described by a linear combination of gaussian radial basis units given b y ( j k ,
h
with K units, (1.6) sa tisfies this requirement.
'
Although the underlying function F is unknown and hence it is impossible to calculate E, an approximation of / J A in terms of the network f with and Lvithout the unit k may be obtained by directly considering the relevance of a unit in the function space. From the geometric representation of a K-unit network sho.rzrn in Figure 1, good estimation of the relevance of unit k is stated as
Christophe Molina and Mahesan Niranjan
857
Figure 1: Geometric illustration of GaRBF networkf (with unit k ) and f’ (without unit k ) in a three-dimensionalfunction space ‘HK. f* represents the orthogonal projection of network f in the subspace ‘HK-I containing f’.
where f* is the orthogonal projection off onto the subspace ‘&I and O k the angle between unit k and the subspace 7 i K - I . The interest of using the projection f* of the network, instead of the network f’(f without unit k), is that it takes into account the best network with K - 1 units that may be obtained if the network is retrained to absorb the loss of information due to the prune of unit k. Moreover, equation 1.7 has a useful property that makes it suitable for growing and pruning. The relevance pk does not need to be computed over the whole training set. The approximation of the relevance of a unit given above has been proposed in Kadirkamanathan and Niranjan (1993) for growing GaRBF networks in sequential learning and has led to a theoretical foundation for the RAN network. In this paper, we show how the above work can be extended to provide automatic and sequential pruning with replacement of resource allocating networks having a limited number of units (LRAN). The lack of a pruning rule for a network having a limited number of units may lead it to have insufficient resources and no way of dealing with the need for additional units. Units that are induced by noise in the data, during the early stages of the algorithm, are a waste of resources. Too large a network can lead to overtraining. Finally, in any hardware training, the resources are going to be finite. This paper is organized as follows: In Section 2 we give a brief review of the F-projection geometric growth criterion and the simplifications leading to the RAN algorithm. In Section 3 we describe how the framework naturally extends to a pruning scheme. Section 4 gives an experimental evaluation of the pruning scheme on the prediction of the laser time series of the Santa Fe competition (Weigend and Gershenfeld 1993).
Pruning on LRAN by F-Projections
858
2 The F-Projection Growth Criterion
Starting from a sequential function estimation approach, Kadirkamanathan and Niranjan (1993) arrived at a theoretical foundation for growing RANs when a new observation ( X , ~ . Y , , )occurs, in which the addition of a new GaRBF OK+Ito a K-unit network was governed by [)k in equation 1.7 exceeding a given threshold <, I'tK+lI
'
IIdK+lIIsin(ilK+l)> E
(2.1)
Under the eqiiality constmitit f ( x l , )= yIIand a siiiootliiicss constraint that consider the underlying function 3 as smooth, the new GaRBF values are assigned as follows: (2.2) (2.3) (2.4) where X is an overlap factor that provides smoothness to the network in terms of the minimal distance from the center of the new unit to the rest of the units. The norm of @K+I depends only on the width ( T K + ~ , which can be considered as constant for each new observation. Therefore, equation 2.1 depends only on the parameters ( Y K + ~ and OK+l and the criterion to increment the complexity of the network can be split into two parameters both exceeding threshold values,
Because of the difficulty of evaluating the angle R, Platt and Kadirkamanathan propose an approximation equivalent to the minimal distance between the input x,, and the GaRBF centers /ik, k = 1 . . .K, expressed as (2.6) where E,, decreases exponentially (Platt 1991) from the upper bound t o until it reaches a lower bound tmin,
(-31
(2.7)
and 7 is a decay constant. Hence, a new GaRBF at step n is added for the observation (xn.yn)if the criteria given in equation 2.5 are satisfied. When the observation ( x f l yn) . does not satisfy the novelty criteria, the LMS algorithm or the extended Kalman filter (EKF) algorithm is used to adapt the output layer coefficients @k and the GaRBF centers p k .
Christophe Molina and Mahesan Niranjan
859
The F-projection growing technique on RANs is then stated as follows: each observation ( x n .y,) &I
- E, -
= max[emi,.€0 . exp(-$)I
If Iyn - f ( x ) l > aminand infF=, IIx,
- pkll
>
then
Allocate a new unit with, QK+I = yn - f ( x ) .
jLK+l =
xn,
and
K
gKfl = infk=, IIxn - pkll Update the number of units, K=K+1 '
& Adapt the network coefficients (and covariance matrices) by LMS (or EKF) {if}
& {for} 3 Pruning Limited Resource Allocating Networks (LRAN)
Pruning techniques have been designed to be implemented on systems with an unlimited or large number of units [see Reed (1993) for a survey of such techniques]. The pruning process removes a large number of units, leaving only a reduced subset to be employed in the final task. This is feasible for software networks, but may be unrealistic for hardware networks where the number of units is severely limited. Before developing the F-projection pruning criterion, we place the RAN in a more realistic and practical context in which the network may grow to contain a maximum number of units K,,,,. This is the case in hardware networks, for example, for which the training algorithm is wired and performed on-line. The principle of F-projection views pruning as a problem of providing an input-output mapping in the function space with a minimal loss of information. In this context, the pruning technique assists in making the best use of the available units, and a unit is pruned only when a relevant observation arrives and nofree unit is available to store this information in the network. Thus, pruning can be viewed as a generalization of growing in which the total information contained in the network is maximized by pruning the least relevant unit and replacing it by another more relevant unit. The reuse of units was first suggested and implemented by Anderson (1993) in a modified RAN architecture employing reinforcement learning. The advantage of our approach is that it remains in Platt's RAN framework, which is globally justified by the F-projection principle. The relevance p k of a unit k, among the K,,,, units of the network, has been stated in equation 1.7 in terms of its amplitude, L2-norm, and
Pruning on LRAN by F-Projections
860
angle with the K-unit network. Because only P k is required to compare the relevance of unit k to the relevance of the other units of the network, the norm 11$k11 may be replaced by the power of the width C J ~in equation 1.7 and, consequently, p k may be expressed as
where the approximation of the angle given in equation 2.6 is taken into account. The comparison between relevances of the least relevant unit j and the candidate unit dn from a new observation (xll,yll)is performed in the same K,,,-dimensional function space. Hence, the relevance p,, of the candidate unit is stated as (3.2)
with amplitude and d,,, parameters assigned as follows: (3.3) I“,, = CJI I
(3.4)
XI,
(3.5)
=
where the least relevant unit d, has been temporarily pruned and replaced by the candidate unit. The F-projection pruning technique may therefore be stated as follows: each observation (xll.yII) & When no more units remain free - Compute the relevance of each unit and keep the least relevant unit 0, and its relevance p, - Compute the relevance pI1for the candidate unit -
zf /A, > p, then Replace d, by qll, nI = n,,% /iI = pl,. and
F, = cr,,
-& Adapt the network coefficients (and covariance matrices) by LMS (or EKF) {if}
-
{when}
& {for}
Christophe Molina and Mahesan Niranjan
861
Although both techniques may be studied separately, pruning with replacement implicitly contains the growing technique. This is easily explained as follows. Suppose that the network initially contains the final limited number of units. These units may be randomly initialized with very low amplitudes and narrow widths (which is equivalent to having no units at the beginning of learning). Once new observations arrive, these random units are pruned and replaced by more relevant units according to the pruning technique. It is evident that no growing is necessary at this stage, and that the network behavior will not change significantly compared with the previous growing algorithm. Moreover, the use of the artificial thresholds cmin and aminfor the distance between units and the minimal unit allocation error, respectively, is no longer necessary. The final F-projection algorithm for LRANs may then be stated as foIlows: 0
Allocate K units at random positions with zero amplitude and small width. Execute the F-projection pruning with replacement algorithm.
3.1 F-Projection Pruning with Replacement on a Toy Problem. Figure 2 shows graphically how F-projection pruning with replacement works for a synthetic problem when new observations arrive. The test consisted of the regression of a one-dimensional curve generated by the equation
y
= cos(2 . 7r
. x) . e-"/O
23
x E [-0.75.0.751
(3.6)
A set of 200 randomly generated observations { x . y} was presented to an LRAN network containing 5 units. The figures on the left show the target function, the approximation, as well as the new observation. The right-hand figures illustrate the contributions made by each of the five units. Table 1 provides numerical results for the amplitude, center, width, distance, and value of the relevance pk as defined in equation 3.1 for each unit when pruning with replacement is performed. The relevance p,! of the new observation (Obs)is calculated using equation 3.2. In Figures 2ad units 1, 5, 2, and 4, respectively, are pruned and replaced by the new observations. Since gradient descent training occurs after each pruning and replacement step, the contributions of the units have slightly different shapes and positions from one figure to the next. 4 Experimental Results
The F-projection growing and pruning technique was tested on the laser data of the Santa Fe competition. These data consist of clean laser intensity data collected from a laboratory experiment (Weigend and Gershen-
Pruning on LRAN by F-Projections
862
't
I
I
05:
21
1 5:---
3
4
---
i
I
I
05
3
05
.
I I
,
1
-0 5
0
05
-0 5
0
05
1
-
p
Figure 2: Pruning with replacement for a S unit LRAN during the approximation of the function given in equation 3.6. Figures on the left show the original function (dash-dot), its LRAN approximation (solid),and the newly arrived observations (circles) at different stages. Figures on the right show the five units (numbered) and their positions. Units 1, 5, 2, and 4 are pruned and replaced by the new observation in turn. Numerical information is given i n Table 1.
Christophe Molina and Mahesan Niranjan
863
Table 1: Numerical Results for the Pruning with Replacement of a 5 Unit LRAN as Shown in Figure 2"
Figure
Unit Amplitude Center Width Distance number ( k ) (w) (Pk) (03
Prune Relevance and ( P A ) replace
1 2 3 4 5 Obs
0.849 0.984 -0.358 0.142 -0.020 -0.222
0.080 0.030 0.430 0.660 -0.710 -0.530
0.006 0.044 0.304 0.200 0.687 0.157
0.050 0.050 0.230 0.230 0.740 0.180
0.000 0.002 0.025 0.007 0.010 0.006
1 2 3 4 5 Obs
-0.222 0.984 -0.353 0.106 -0.007 0.624
-0.530 0.030 0.492 0.678 -0.710 0.150
0.157 0.044 0.304 0.200 0.687 0.104
0.180 0.462 0.186 0.186 0.180 0.120
0.006 0.020 0.020 0.004 0.001 0.008
C
1 2 3 4 5 Obs
-0.222 0.985 -0.400 0.097 0.666 0.859
-0.530 0.055 0.485 0.684 0.121 0.000
0.157 0.044 0.304 0.200 0.104 0.106
0.585 0.066 0.199 0.199 0.066 0.121
0.020 0.003 0.024 0.004 0.005 0.011
d
1 2 3 4 5 Obs
-0.222 0.859 -0.400 0.097 0.666 0.227
-0.530 0.000 0.485 0.684 0.121 0.640
0.157 0.106 0.304 0.200 0.104 0.135
0.530 0.121 0.199 0.199 0.121 0.155
0.018 0.011 0.024 0.004 0.008 0.005
a
b
Yes -
1
-
Yes 5
-
Yes -
4
"The relevances of units as well as new observations are respectively calculated according to equations 3.1 and 3.2, and the least relevant units are pruned and replaced by the new observations.
feld 1993). Although deterministic, its behavior is chaotic as shown in Figure 3. The laser data given for the competition consisted of 1000 observations {y,}, n = 1,. . . 1000, and the goal was to predict 100 observations ahead at five different times t = (1000,2180,3870,4000,5180)in the time series. Observations where coded into just 8-bits between the values 3 and 255. No information was given about the dynamics of laser data, nor about its embedding dimension and variable dependencies, also known as the time series representation. We used an algorithm based on geometric techniques to determine the embedding dimension of the laser data (Molina and Niranjan 1995). The most relevant past observations for the prediction of the next turned out to be the following 27 observations {I, 2,5,7,9,11,13... . ,18,21,23,25,27,34,38,41,44,49,52,61,68,69,79,95} according to our algorithm.
.
Pruning on LRAN by F-Projections
864
3cm 250
I
I
I
I
I
I
I
I
I
I
Figure 3: Original 1000 laser observations provided in the time series prediction Santa Fe competition.
The task of prediction may be achieved using three different techniques. The first consists of using a one-sfep-ahead predicfor from observation ( y l l )and recursively predict observation (y,+l) until (yt+loo).In this case, estimated observations are used to construct input vectors. The second technique consists of directly predicting observation ( y l i + , ~ ~from ,~) available data. Finally, a combination of both can be used (Sauer 1993). In our experiments, we implemented a recursive predictor, which turned out to be of sufficient quality for our prediction purposes. We used an LRAN network with 200 units that were initially randomly distributed in the input space. Their amplitudes c q were initialized with zero values and their widths DA with the value lo-'. The overlap factor X was fixed at 0.87. During learning the same gradient descent technique (LMS) used in Platt (1991) was employed to adapt the LRAN coefficients. The learning rate was fixed at 0.02. Training data were randomized and presented to the network during 20 epochs. Figure 4 shows the results obtained by LRAN under the above configuration for the one-step prediction of the laser data at 5 different initial points. The results clearly show that each prediction remains close to its corresponding observation. Numerical results are given in Table 2. Figure 5 shows the same portions of the laser time series, this time predicted using a recursive 100-step ahead LRAN predictor. The results turned out to correspond closely to the original time series. Differences generally arose due to a small drift in the prediction at each step with respect to the target observation, particularly in the first portion, since here the predictor completely mislead the prediction at approximately time 1050. Despite this, the results still match the targets closely.
Christophe Molina and Mahesan Niranjan
I
250’
r
I
865
I
,
i
I
i 1
\
,
1
200 150 100
50
3870
3880
3890
3900
3910
3920
3930
3940
39%
3960
3970
4010
4020
4030
4040
4050
4060
4070
4080
4090
4100
150
100
50
4000
200
-
15010050 07
Figure 4: LRAN one-step prediction (continuous curves) at different starting points of the laser time series. The dashed curves correspond to the true time series.
Pruning on LRAN by F-Projections
3870
3880
3890
3900
39iO
3920
3930
3940
3950
3960
Figure 5: LRAN 100 recursive step prediction (continuous curves) a t different starting points of the laser time series. The dashed curves correspond to the true time series.
Christophe Molina and Mahesan Niranjan
867
Table 2: NMSE Obtained by LRAN (]-Step Ahead and 100-Steps Ahead), Delay Coordinate Embedding (Sauer 19931, and Internal Delay Lines (Wan 1993) on the Prediction of Laser Data at Different Starting Points Starting point
NMSE One-step ahead
NMSE 100-steps ahead
NMSE Sauer
NMSE Wan
1000 2180 3870 4000 5180
0.0102 0.0082 0.0119 0.0072 0.0119
0.2515 0.0093 0.1237 0.0736 0.0228
0.0770 0.1740 0.1830 0.0060 0.2540
0.0270 0.0650 0.4870 0.0230 0.1600
Total
0.0494
0.4809
0.5510
0.7620
5 Conclusion We have presented an algorithm for pruning and replacing units in limited resource allocating networks. The pruning algorithm is based on the F-projection principle of sequential function estimation, and generalizes previous work that has been done in growing RANs (Kadirkamanathan and Niranjan 1993). The performance of growing and pruning techniques for LRAN has been illustrated using the laser time series of the Santa Fe competition, and showed that a neural network solution based on the reallocation of units predicts the time series as successfully as the approaches used by the winners of the Santa Fe competition.
Acknowledgment The authors wish to thank Thomas Niesler and two anonymous referees for their valuable comments and suggestions. This research was supported by EPSRC grant No. GR/H16759.
References
Anderson, C. W. 1993. Q-learning with hidden-unit restarting. In Aduaiices in NIPS, S. J. Hanson, J. D. Cowan, and C. L. G., eds., pp. 81-88. Kadirkamanathan, V., and Niranjan, M. 1993. A function estimation approach to sequential learning with neural networks. Neural Comp. 5(6), 954-975. LeCun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in NIPS, D. S. Touretzky, ed. No. 2, pp. 598-605. Molina, C., and Niranjan, M. 1995. Finding the Embedding Dimension of Tiiiie Series by Geometrical Techniques. Tech. Rep. CUED/F-INFENG/TR.221 1995, Cambridge University Engineering Department, Cambridge, England.
Pruning on LRAN by F-Projections
868
Mozer, M. C., and Sniolensky, P. 1989. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Ahnriccs iri NIPS, D. S. Touretrky, ed., No. I , pp. 107-115. I’latt, J. C. 1991. A resource allocating netwcxk for function interpolation. Neitrd C
340-747. Sauer, T. 1993. Tinif series prediction by using delay coordinate embedding. ril Tiriic Suit,s Prtdic-tiorz: Fort~-n.sfiri,ytiic F i i f i r w nt7d Urirfcrsfntidirig tiic Pmt, A. S. Weigend and N. A. Gershenfcld, eds., pp. 175-194. Addison-Wesley, lieacling, MA. Wan, E. A. 1993. Time series prediciton hy using a connectionist network with internal delay lines. In Tim, Srries Prrdictiori: Forc~~stirig tlif Firtirrc,n t i d Urzders t n i i ~ f i r rfi7tJ ~ Pnst, A. 5. Weigend and N. A. Gershenfeld, eds., pp. 175-194. Addison-Wesley, Reading, MA. Weigend, A. S.,m d Gcrshenfeld, N. A. 1993. Tirriz Suit,s Pwdktiori: F L J I . L ’ C I ? S ~ ~ ~ I ~ t/w F i i t u w ~ r i i i f Lliirf~,r..tiirrifiris tlw Pi?st, 1 st ed. Addison-Wesley, Reading, MA. I l w i v e d October 17, 1994; accrptcd Oztiibcr 3, 19%
This article has been cited by: 2. A.D. Back, T.P. Trappenberg. 2001. Selecting inputs for modeling using normalized higher order statistics and independent component analysis. IEEE Transactions on Neural Networks 12:3, 612-617. [CrossRef]
Communicated by Dan Hammerstrom and Manvan Jabri
Engineering Multiversion Neural-Net Systems D. Partridge W. B. Yates Department of Computer Science, University of Exeter, Exeter E X 4 4PT, England
In this paper we address the problem of constructing reliable neuralnet implementations, given the assumption that any particular implementation will not be totally correct. The approach taken in this paper is to organize the inevitable errors so as to minimize their impact in the context of a multiversion system, i.e., the system functionality is reproduced in multiple versions, which together will constitute the neural-net system. The unique characteristics of :ieural computing are exploited in order to engineer reliable systems in the form of diverse, multiversion systems that are used together with a ”decision strategy” (such as majority vote). Theoretical notions of ”methodological diversity’’ contributing to the improvement of system performance are implemented and tested. An important aspect of the engineering of an optimal system is to overproduce the components and then choose an optimal subset. Three general techniques for choosing final system components are implemented and evaluated. Several different approaches to the effective engineering of complex multiversion systems designs are realized and evaluated to determine overall reliability as well as reliability of the overall system in comparison to the lesser reliability of component substructures. 1 Introduction
In this paper, we shall examine several novel strategies for constructing a multiversion neural-net system that is more reliable than the individual components from which it is constructed. A multiversion system is one in which the basic functionality of the system is reimplemented in different versions. To reap any benefit, the versions must be diverse, i.e., should be different approximations to the desired function. In the context of a voting decision strategy, the necessary diversity is obtained when the versions contain different errors, lack of coincident failure, i.e., when one version fails, others do not exhibit the same fault. A comprehensive theoretical analysis of multiversion system performance, using only conventional programs as the versions, has been provided by Littlewood and Miller (1989). They prove that diversity will pay off in terms of improved system reliability, but the question remains Neural Coinyufatiorz 8, 869-893 (1996)
@ 1996 Massachusetts Institute of Technology
870
D. Partridge and W. B. Yates
of how to best generate diversity. They suggest that diversity of process (i.e., the way the versions are constructed) should lead to diversity of the products. In the neural computing context, the products are trained networks and the process by which they are obtained is an algorithmic one of training. Diversity of process is then determined by the initial conditions for training, e g , net type chosen, net architecture, training set structure, training algorithm parameter settings. In earlier studies we have established an ordering (with respect to subsequent diversity) for a number of the essential determinants of the initial conditions for training. In this study we use this ordering to exploit "methodological diversity" in the way that Littlewood and Miller suggest. The purpose of this study is to explore possibilites for systematically engineering diversity in multiversion systems. There is no doubt of the overall payoff from diversity, but can we reliably generate it? And can we generate enough of it to make the extra effort worthwhile? A proper costs-benefits analysis must be problem specific as a 1% increase, for example, may be trivial or highly significant, dependent upon particular problem circumstances. Although we systematically exploit the known diversity-generating potential of some parameters, our method also tacitly acknowledges that precise engineering of maximum diversity is beyond the current state of our knowledge in neural computing. So, instead of attempting to generate only the necessary diverse networks directly, we train a population of networks that is larger than actually required. We then choose from this "space" of trained networks, the z~r~sioiz spoce, a small number of specific versions as the final multiversion system. This overproduction and subsequent choice is the oaerprodiice aiirf clioosr strategy. The chosen versions, together with the appropriate decision strategy (e.g., majorityvote), constitutes the final multiversion system. We explore three alternative strategies for choosing a diverse set of 15 versions from the version space, and we explore two ways to constrain the choices-freely from the complete space, or three sets of five each from a (methodologically) diverse subspace. We compare two arrangements of each 15-a simple set of 15 and three sets of five-which also permits the evaluation of two different voting strategies on each of the two architectural variants as illustrated in Figure 1. Each overall system performance is evaluated relative to the performance of the best individual component versions (and groups of versions) to determine the relative success of the overall system designs.
2 Tasks -
To probe the idea that methodological diversity can be exploited to obtain product diversity, we use three different tasks: two well-defined abstract
Multiversion Neural-Net Systems
871
majority of 3 randomly selected
majoritv of
3 majorities
one from each set
majority of 3
majority
I random
FLAT 15 SYSTE,M
iniajurily
I rmduni
in:goriiy
I randurn
tiiaj~rily
'TWO-LEVEL 3 ~ s5Y s r i i M
Alteriialive inultiversion sy\tcrn designs
Figure 1: The two system architectures and decision strategies explored. tasks (LIC1 and LIC4) and one real-world, data-defined task, a character recognition problem (OCR). LICl and LIC4 are two (of 15) Launch Interceptor Conditions that together are a major component of the Launch Interceptor Problem [specified in Knight and Leveson (1986)], a problem that we are investigating as a whole in other research. In this study we shall provide full details with respect to LICl only, but present the overall results from the other two tasks as an indication that our general conclusions are not artifacts of LICl. LICl is specified as a predicate that evaluates to true if "There exists at least one set of two consecutive data points that is a distance greater than the length LENGTH1 apart. (0 5
LENGTH^)" where LENGTH1 is a parameter of the condition. We shall abstract a functional representation of this precise, though informal description. The intention here is to make explicit the exact structure and semantics of the network task. Let S = [O. 1] x [0,1] represent the set of all data points, let a = (XI. yl) and b = (x2,y2) represent two consecutive data points, and note that the function d(a, b) = J ( x l - ~ 2 + )(y1~ - ~ 2 is) the~ EucIidean distance on S.
D. Partridge and W. B. Yates
872
-
Formally, we require a neural network implementation of the bounded function f : S x S 4 [O. 11 U defined by if din. b ) > LENGTH1 otherwise
1 0
f ( n . b. LENGTH1) =
I n fact, as our task will ultimately be executed on some digital computer, we shall restrict our attention to the finite subset of rationals, with 6 decimal places, in S x S Y [O. 11. In this special case we note that f is also continuous. LIC4 is a similar, but more complex Boolean function: three data points demarcate a triangle, and the task is to decide whether the area of this triangle is greater than the value of a seventh parameter. The OCR task was taken from the Machine Learning database at UC Irvine. The problem is to classify 16 parameter values as one of the 26 characters A-Z. The database totals 20,000 examples. Full details are provided by Frey and Slate (1991) who explored a variety of rule-induction approaches and demonstrated 60-80%) correct recognition (when trained on 16,000 and tested on 4000) with two particular versions above this level giving 81.6 and 82.770 performances. 3 Measures of Performance
Two rather different types of performance measure need to be defined: the decision strategies to be used, and the diversity generated both within and between version sets. Both measures are based on pI1,the probability that exactly iz versions fail on a random test, where N is the total number of versions and 0 5 iz 5 N . This probability can be computed from the results of the version set on a large and representative test set of data, in which case pII = no. of tests that fail on IZ versions/total number of tests. 3.1 Decision Strategies. The statistic for overall or simple majority,' p(inaj), is simply the summation of the probabilities that exactly 0.1 . . . k versions fail, where k = N - majority number. i.
C p1
~ ( ~ 7 = 7 4
1
4
The following statistic, p(nznj3),defines the probability that a majority of three versions selected at random are correct.
p(mnj3)
=
rr=O
~
(NN
12)
( N - I 2 - 1)( N - iz - 2 ) (N-1) (N-2)
(N-iZ)(N-?Z-l) +G lm( N - 2 ) 3pll ?I
11=1
'When N is odd as in all examples in this study.
Pll
Multiversion Neural-Net Systems
873
To compare the reliability of a single-set system with that of a multiset system, we require a statistic to give the probability that a majority out of three versions is correct, p(rnuj3)ABc, when the three versions are selected one at random from each of the sets, A, B, and C, which contain NA, N 5 , and Nc versions, respectively.
The first term in the final formulation2 is the probability that all three versions will be correct, and the second, third, and fourth terms are the probabilities that exactly one of the 3 versions will fail-one term each for the probability of failure of one chosen version and the nonfailure of the other two, when the failing version is from set A, from set B and from set C, respectively. The majority decision strategy for multiset systems is to take the majority of each component set and then take the majority of the componentset majorities. This statistic is p(rnajmaj). If p ( m ~ jis) ~the~ probability ~ that a majority of versions are correct when exactly iz versions fail in set X, then p(nzai)lly=
x;'Note: \
1 0
if n < majority number otherwise
for brevity we write a triple summation, such as
NII.N,
, . u ~ = ~ . I I ,= I .
x ~ ~ - , l x ~as~ - , l ~ ~ , c
D. Partridge and W. B. Yates
874
3.2 Diversity Measures. The diversity measures have been defined, justified, and fully explained elsewhere [e.g., Partridge (1994) and Yates and Partridge (1995)l. Here we shall simply provide the definitions with a minimum of explanation. 1. Within-set diversity, G D x , the diversity within set X, or simply G D when there is only one set under consideration.
p ( 2 both fail) p ( l foils) where 1112 both foil), the probability that two randomly selected versions from the set will both fail on a randomly selected input, is ~ ~ ~ 2 ( 1 7 / N-) 1/N ( r r - 1 )p,,, and p ( 1 foils), the probability that one randomly selected version will fail on a randomly selected input, is C!=,(n/N)p,,. GD has a minimum value of 0 and a maximum value of 1. 2. Between-set diversity, GDBx,, the diversity between version sets X and Y, or simply GDB when there are only two sets under consideration. p(l foils in A &a 1 foils iiz B ) GDB,qB = 1 iiinsjp( 1 fails in A ) .p ( 1 foils iii B ) ] where the single failure probabilites are as above and t~ioxselects the greater of the two, and p(1 fails iii A & o 1 fails iiz B) = ~ , ~ , ; ( I ~C , ~~ /; (NI Z , ~~ )/ N H ) ~when , , , , , set ~ A contains N,,versions, set B contains NR versions, and pl, is the probability that exactly 17,4 versions in set A and exactly i z B versions in set B fail. GDB has a minimum value of 0 and a maximum value of 1. 3. A combination within-set and between-set diversity, GDBW,,iB, or simply GDBW when only two sets are under consideration. 1 GDBW,\B = GDB,.\K - -(GD,,j + GDg) 2 CDBW has a minimum value of -1 and a maximum of 1
GD=1-
4 Generating Version Space
Earlier studies (Partridge 1994; Yates and Partridge 1995) indicate that some important parameters of neural net training have the following ordering with respect to decreasing diversity-generating potential: rict t y p e >: trairriiig set structurt. > troiiririg s r t rlerrieizts > rzrirnbrr of hidden w i t s =Z zc~eighfsetd. This ordering suggests the following division of version space will yield maximum diversity between subspaces. 4.1 Neural Net Types. As a prime methodological variation we use two types of neural net-multilayer perceptrons (MLP) and radial basis function (RBF) nets-as the primary source of useful diversity.
Multiversion Neural-Net Systems
875
4.1.1 MLPs. For the MLP versions we use 2-layer nets, trained using the backpropagation algorithm with momentum. These are ”standard” MLPs (see Rumelhart et al. 1986) for which the initial weights are randomly initialized (in the interval [-0.5.0.51) and trained to completion (ie., every pattern learned to a tolerance of 0.5) or 20,000 epochs, whichever is first. A full specification of these networks and the training regime can be found in Partridge and Yates (1994). 4.1.2 RBFs. The second type of neural net used is the RBF nets of Moody and Darken (1988). A full specification of our implementation can be found in Yates and Partridge (1995). An RBF net consists of a set @ of rn radial basis functions (or RBFs)
where typically rn > n. Each RBF receives n input signals from the external environment and is connected to a single output unit that computes a function f of the form f : R”’ +R defined by
where x f R”is an input pattern and w f R”’ are the weights from the RBFs to the output unit. The computational properties of the network are determined by the choice of radial basis functions. In this paper we shall employ the constant function 1 and a number of gaussian response functions. The functions are radially symmetric, that is, the function attains a single maximum value at its center, which drops off rapidly to zero at large radii. By varying the center ci and radius ri we may vary the position and shape of the function’s receptive field. 4.2 Training Sets. According to our ordering we have training set structure followed by training set content to exploit.
4.2.1 Structure Variation. Two training set structures were used: random (T) and rational (R).The random pattern generator produced input patterns entirely at random-these are the T training sets. The ”rational” pattern generator produced input patterns that are close to (within 10% of) the true/false decision boundary in LICl, but randomly within this constraint-these are the R training sets.
D Partridge ; ~ n dW B Yaks
876
1.2.2 Coirftwf Vnti7tioi~. By using five different seeds in conjunction with the tiyo generators ive obtained 10 training sets (of 1000 patterns each)-Tl, T2, T3, T4, and TS, ancl by R1, R2, R3, R4, and R5, respectively, for the rnndoni and rational training sets. 4.3 Hidden Units. We used fi\,c different numbers of hidden units, tlie H parameter
H for MLP nets: the hidden unit numbers used are 8, 9, 10, 11, and 1' H for RBF nets: the hidden unit numbers used are 50, 60, 70, 80, a n d 90 4.4 Weight Initialization. The random initialization of weights is \.,jricd by using 3 different \\.eight seeds \vith tlie random number gencrator, the W parameter:
W for both types of net: the different seeds will be denoted by W1, W2, W3, W1,and W5 These parameter labels, together with specific \dues, provide a succinct m d unambiguous naming scheme for each point in net space. Thus T3-t19-W2 is the point in net space occupied by an MLP with 9 hidden units, initialized tv-itli weight seed 2, and trained on random training set 3. Given a choice of net type (MLI' or RBF) ancl training set structure (T or R) together with 5 values 011 each of three dimensions (training set seed for a given generator, number of hidden units, and weight seed), a population of 300 versions in all was trained-the complete version space. As we will be using a three-subset architecture, it is convenient to divide this space into three methodologically diverse subspaces: RMLP versions-ratioiially
trained MLPs
TRBF \.ersions-raiidonily
trained RBF nets
TMLP and RRBF \,ersions-raiidonily trained RBF nets
trained MLPs and rationally
The complete space and the three subspaces froin which tlie system versions will be chosen are illustrated in Figure 2. 5 Choosing from Version Space
Once we Iiave a population of candidate versions, the problem is how to pick out a subset such that they constitute an optimal multiversion system. We explored three techniques for choosing an optimal subset.
877
Multiversion Neural-Net Systems
I I I I
I I
K4
I
R?
I I I I I I
I
1
RMLP
TMLP
I
I
I
I
I
R?
R1 WI H8
HY
T4
I
T3
HI1
Hi21
;hi RS
I
TRHF
I I
Hi0
I I
TS
I
I I
I
I I I
I
I
I
I
I
I
I I
L--I
I I I I I
RS
I
R4
HX
H9 : f i l O
If11
;
v
l
e
l
r
I
S
I I
I
'
I
I
0
I
1112 8 I
I I I I
'
10
I
RRBF
I
n
I
T2
I
Figure 2: The 500-net version space.
5.1 Picking the Best. An obvious strategy, given that overall system performance is the ultimate goal, is to pick the "best" individual versions in conjunction with the basic idea of exploiting methodological diversity. The "best" individual versions are assessed by subjecting the trained versions to a large and representative set of previously unseen patterns; then the generalization performance of each version will give us an ordering from best to worst. A basic system may then be constructed from the 15 best nets out of the complete 500. This is the best15 system. For comparison purposes we can then divide the 15 best versions into three separate sets by using the choice-order strategy to spread the best performances evenly: put the three best versions in separate sets, and the next three best each in different sets (in reverse order), and the next three best in separate sets (reversing the order once more), until all 15 have been assigned to one of the 3 sets-this is the best.order system.
D. Partridge and W. B. Yaks
878
To examine whether process diversity does lead to product diversity, we can then choose the best five versions from each subspace to provide the three version sets in the system-this is the best.struc system. And for comparison purposes these 15 nets can be used as a single set, the best.strucl5 system. 5.2 Picking According to a Heuristic. To further raise the importance of minimum coincident failure (i.e., emphasize diversity) rather than the best individual performance, a heuristic that picks a minimum failure subset (on the basis of test results) was used. Many heuristics are reasonable as processes to minimize coincident failures with an acceptable efficiency. The particular heuristic used was c r e a t e a group c o n t a i n i n g t h e b e s t network
while t h i s group i s not b i g enough do for each network ii remaining i n t h e space .iiort’ 0 for each p a t t e r n p i n t h e t e s t s e t if more t h a n h a l f t h e networks i n t h e group f a i l on p a t t e r n p and network 17 i s c o r r e c t on p a t t e r n p t h e n score‘ score - 1 if more t h a n h a l f t h e networks i n t h e group f a i l on p a t t e r n p and network H f a i l s on p a t t e r n p t h e n scoI”’ score‘+ 1 end for end for
-
t
-
remove network w i t h lowest s c o r e from t h e space and add i t t o t h e group
end while
As a single set the chosen 15 is the pick15 system. Division of this 15 into three sets of five each can use the choice-ordrr strategy to give the pick.order system. Finally, still following the pattern of the previous technique, the heuristic can be applied separately to each of the three subspaces to give the three separate sets (of five each) required. This is the pick.struc system, and for comparison purposes these 15 versions can be treated as a single-set system, the pick.strucl5 system. 5.3 Genetic Algorithms. A final approach to choosing an optimal subset is to use a genetic algorithm to search for high performance multiversion systems. A genetic algorithm consists of a finite pool of chromosomes (representing multiversion systems), a fitness function (one of the measures of performance described in Section 31, a number of stylized genetic operators, and a birth and death strategy. The chromosomes used in this paper were represented by fixed length binary strings. Each string encodes the components of the multiversion
Multiversion Neural-Net Systems
1 NETWORK
. .
R
2 ‘ H50- W2
1 NETWORK
. .
. . .,, :;..... ___._; _ _ _ _ _ ___._____... , : . ,..’ . ; ____.__._.... .... $ , ......... ~
879
____(
key: 0 to 4 bits ‘ON’ in first 3 fields codes for one of the five values for each parameter; last two bits code for binary choices as shown
Figure 3: The encoding of a chromosome.
system, i.e., T3-H8-W1, R5-H70-W4, etc. The chromosome structure is shown in Figure 3. Initially, a pool of potential multiversion systems is randomly created and its fitnesses calculated. These fitness values are used to calculate the probability of choosing a particular chromosome in the pool. The fitter the chromosome the more chance it has of being chosen. Chromosomes that are chosen are used as arguments to a randomly selected genetic operator. The genetic operators manipulate existing chromosomes at bit level to produce new chromosomes. In this paper we shall employ homologous crossover and mutate, and inversion (together with an order free representation). The arguments of a genetic operator are called parents, while the resulting chromosomes are called children. The birth and death strategy dictates how chromosomes in the pool are replaced by child chromosomes. In this paper the birth and death strategy employed was (the fittest) one child per generation, and replaces the weakest chromosome. 6 Related Research
There appears to be very little neural net research directly related to our approach. What we call the diversity of trained neural nets has been featured in a number of previous studies. What we have done is to provide a formal definition of this characteristic and systematically evaluate the effects of the major training ”parameters” (e.g., net type, training set structure) on the diversity of the trained nets obtained. Earlier papers report these results and survey the relevant related work (see Partridge 1994; Yates and Partridge 1995).
880
D. Partridge and W. B. Yates
The multiversion idea has been primarily studied by software engineers and in the context of only conventionally programmed versions. The application to neural computing is somewhat of a new departure, although not entirely new. Drucker c x f nl. (1994) explore and present tlie results of ”boosting and other ensemble techniques” in which they show the benefits of a specific three-version strategy when the training set size is sufficiently large. The basic “boosting” approach uses three networks that are trained sequentially using the first-trained network to “filter” ‘3 training set for the second, and then the second and third similarly ”filter” ;1 training set for the third net. This approach is clearly rather different from ours and we propose to evaluate it, and eventually perhaps t o apply it by having, say, one component version set (in a two-level system) organized as such a ”boosted” ensemble. Pearlmutter and Rosenfelci (1991) survey the sparse literature on “replicated networks“ in which the netnwrks’ outputs are averaged (as they ‘ire also in the ”boosting” approach) to get the overall result. They cite such work on a speech task in which ”the generalization performance of the composite network ivas signil’icantly higher than that of any of its component netlvorks.” And they note that ”replicated iinpleiiientations programmed from identical specifications is a coninion technique in software engineering of highly reliable systems.” They analyze replicated iietwwrks, in which a number of identical networks are independently trained on the same data and their results averaged. They conclude “that replication almost always results in a decrease i n tlie expected complexity of tlie network, and that replication therefore increases expected generalization.” They produced empirical results to support this conclusion. I n our terminology they exploited the diversity generated by weight-seed \)ariation. L.incoln and Skrzypek (1990) also perform some experiments to investigate the increase in fault toleraiict that a set of versions produces compared lvitli any onc. netivork. They ivrite of tlie “ ‘synergy’ of clustering multiple back-prop nets.” And again they use an “average” output of their clusters rather than a majority-vote strategy. Their network “clusters” again are based on \\,eight seed \.ariation only. They show nn increase of perforniance and fault tolerancc exhibited by clusters in comparison to single networks.
Having filled the version space with 300 netLvorks, each trained to con\wgence as specified earlier on 1000 training patterns either randomly or rationally constructed, each of the three techniques for choosing component versions Lvas examintd.
Multiversion Neural-Net Systems
881
Table 1: Using the Best 15 Versions best15
Majority
best.order
mujmuj 98.46 m a j 3 ~ ~ c 98.28
Average
98.48 98.27 97.97
av.GDBW
0.014
GD
0.530
av.GDB
0.540
muj3
Given that the system configurations to be used were based on 15 versions, either as a single version set, a "flat" 15 system, or as three separate sets each containing five versions, a two-level 3 x 5 system, the first task was to choose 15 versions from the space of 500. As a basis for choosing the system versions from version space, a "structured" set of 161,051 patterns was used: all combinations of the five input variables with values from 0.0 to 1.0 in steps of 0.1, i.e., 11 values per input variable, and thus 115 = 161,051 in the structured set. System versions were picked out from the version space on the basis of their performances recorded on this structured set. Each system configuration was evaluated on a further 10,000 random patterns, the "test" set. All subsequent data are based on test-set performances. Generalization and decision strategy performances are given as percent correct, GD and GDB values range from 0 to 1, and GDBW ranges from -1 to +l. The best individual net exhibited a 98.40% generalization on the test set. 7.1 Picking the Best. The results are summarized in Table 1. The three pairwise GDBW and GDB values are averaged to provide the entries "av.GDBW" and "av.GDB," respectively. The GD within the 15 best versions is not high because they were chosen on the basis of individual generalization performance, not diversity. Consequently the average generalization is high at 97.97%,but neither of the decision strategies can provide much improvement-the biggest improvement, approximately 0.5%, is achieved by taking the majority vote from all 15 versions. In addition, the two-level system offers no further improvement. This is to be expected because the more complex, two-level system was not constructed on the basis of any diversity judgmentmerely highest performance individual versions dispersed evenly over three separate sets. The low average GDBW value, 0.014, indicates that the between-set GD was much the same as within each individual set. This situation suggests that there is no gain to be made by selecting versions from separate sets rather than from within one composite set as the results demonstrate. By picking the best from zuithiiz subsets of the zwsioir spacc, which are demarcated by maximal diversity-generating processes (as described ear-
D. Partridge and W. 8. Yates
882
Table 2: Using the Best 15 from 3 Diverse Processes best.strucl5
Malority nraj3
Average GD
98.85 98.43 97.83 0.653
best.struc mnjttinj
98.82
i i i n j 3 ~ ~ 98.52 ~
av.GDBW av.GDB
0.192 0.727
lier), we expect to be able to exploit diversity between the separate version sets of a two-level system. The results are given in Table 2. All "majority" performances have been improved. This is because there is greater diversity in these systems even though average performance of the individual versions has decreased, from 97.97 down to 97.83%. As the selection process has not changed, only the (subbpaces from which the best individual versions have been picked, it must be these divisions of version space that account for the higher diversity values obtained. As these divisions were determined by process differences, we can conclude that diverse processes do lead to diverse products, which in turn leads to better system performance. Notice that if the chosen decision strategy is to evaluate only three versions (rather than all 15) and take the majority outcome, then the two-level system used with between-set selection is superior, a 98.52% performance compared with 98.43% for the flat 15 system composed of exactly the same individual versions. This is because there is substantially more diversity between sets (0.787, 0.687, and 0.708 between each pair of version sets) than within each set (0.507, 0.572, and 0.526). This disparity, which favors a two-level system when evaluating only three versions, is captured in the average GDBW value of 0.192, an order of magnitude larger than for the best.order system. However, picking the "best" individual versions is expected to be too simplistic; it does not fully exploit the important characteristic of diversity. The other two strategies for selecting version sets from the Lrersion space were designed to do just this. 7.2 The "Pick Heuristic. We observe (in Table 3 ) a high GD value for the flat 15, as expected. The overall majority is yet a futher improvement over the earlier results. But this is not the case for the majority of three random versions, either within the flat 15 system or between the three separate sets of the two-level system. In both cases the performance of the "picked" systems is the worst so far observed. This divergence of performance characteristics can be explained as an outcome of high GD values coupled with low average individual performances (almost 2 5% worse than for the "best" systems). When all 15 versions contribute
Multiversion Neural-Net Systems
883
Table 3: Picking Maximum Diversity pick15
pick.order
~~
Majority ma13 Average GD
99.31 97.60 95.44 0.783
~~~
majmaj maj3ABC
av.GDBW av.GDB
99.21 97.41 -0.0622 0.837
Table 4: Picking Maximum Diversity within Three Diverse Processes pick.strucl5 Majority maj3 Average GD
99.25 98.41 97.42 0.729
pick.struc
majmaj maj3~~c av.GDBW av.GDB
99.10 98.43 0.0478 0.713
to the system performance (i.e., majority and majmaj results) the high GD value more than compensates for the low individual performers (the lowest is only 84.79% correct). But when system performance is based on only three versions, selected at random, the chances of selecting a low performance version outweighs the diversity boost to be obtained from only three versions. Notice also that the two-level system is clearly inferior, especially with respect to ”three-version majority” decision strategy. This is because within-set diversity is greater than between-set diversity, as the negative GDBW value clearly indicates. In fact, the individual within-set diversities in the two-level system, 0.844, 0.839, and 0.829, were the largest that were generated in this first set of experiments. And these are subsets that together as pick15 exhibit considerably less diversity (GD = 0.783). We shall return to this point subsequently. The performances of pick.stmc and pick.stmcl5 are summarized in Table 4. With these systems we observe overall majorities have decreased but both “majority of three random” strategies show a substantial increase (approaching 1%better than for freely picked systems). This general trend is accounted for by the increase in average performance (up nearly 2% from pick15 to pick.strucl5) coupled with the decrease in overall diversity. The overall majority performance is lower because the improvement in average performance does not quite compensate for the decrease in diversity, while the “majority of three random” strategies gain disproportionately from the increase in average performance. When only three versions are used (rather than 15) the decision strategy is naturally more
884
D. Partridge and W. B. Yates
Table 5: Using a Genetic Algorithm
Majority 1rrn/3
Average GD
99.30 98.37 87 52 0.767
YY.08 97.63
84.79 0.793
sensitive to the existence of low performance individual versions. Notice also that the within-set and between-set diversities have reversed their relative positions, and average GDBW is once again a positive value. The effect of this change is to make the ”majority of three random” strategy superior in the two-level system; in the pick15 and pick.order case the flat 15 system is superior. 7.3 Using a Genetic Algorithm. The final technique used for choosing component versions from within the version space is a genetic algorithm (GA). For comparison purposes the first experiment allowed the GA to choose 15 versions from anywhere within version space. As the ultimate goal is to produce high-performance systems and we have elected to use majority-vote decision strategies, it makes sense to have the GA optimize the majority-vote statistic. But as a potentially useful direct comparison, the GA was also used to maximize GD. The relevant results are summarized in Table 5. Broadly the results are as expected: the GA when optimizing majority-vote returns 15 versions with a high majorityvote performance, comparable with the highest value we have yet seen; and when optimizing GD we get 15 versions with a high GD-the highest we have seen so far in a 15-version set-but a disappointingly low majority vote, seemingly pulled down by the low average performance of the individual versions chosen. To attempt to exploit the diversity that is to be found between certain subsets of the version space, the GA was used (optimizing majority vote) to choose five versions from the same three subspaces as used in the earlier experiments; this gave the ga.struc system and, when collapsed into a flat 15, we have the ga.strucl5 system. The results of these two systems are summarized in Table 6. Overall, the result of using the GA restricted to diverse subspaces does not appear to be advantageous, although the benefit of forcing between-set diversity at the cost of loss of overall diversity can be seen, as usual, in the superior performance of the ”majority of three random” decision strategy when the three versions are each chosen from a different set. A perceived weakness in the GA experiments that attempted to exploit the known diverse subspaces of the version space is that each ap-
Multiversion Neural-Net Systems
885
Table 6: Using a Genetic Algorithm within Three Diverse Subspaces ga.strucl5
Majority Im]3
Average GD
gastruc
99.14 98.44 97.62 0.709
rrinjnzaj rirnj3~~c av.GDBW av.GDB
99.09 98.47 0.0652 0.746
Table 7: Using a Genetic Algorithm between Diverse Subspaces ga.majmaj
mnjt naj
WABC av.GDBW
av.GDB
99.10 98.19 0.0912 0.818
plication of the GA to a subspace takes no account of its choices within the other two subspaces. A last experiment with the GA attempted to remedy this deficiency. The GA was again set up to select five versions from each subspace, but to do so by optimizing the ”majority of majorities” decision strategy. The results are given in Table 7. The results are not particularly promising. The ”majority of majorities” performance is good, but not clearly better than that obtained with the GA operating in each subspace independently (i.e., the ga.struc system). And substantially better performances are exhibited by systems constructed using the pick heuristic. 7.4 Majority Performance and Diversity. The basic idea underlying the attempts to engineer optimal multiversion systems is that some majority decision strategy will produce an optimal performance provided that the component versions exhibit diversity. However, the results provided above indicate in several places that the relationship between majority performance and diversity is nonsimple. Choosing component versions for maximum diversity favors individually low performance networks that can work against majority decision strategies, particularly decision strategies that use only a subset of the system versions such as the majority of three random versions. But choosing component versions to maximize a majority decision strategy tends to produce a suboptimal result because the benefits of diversity become neglected.
D. Partridge and W. B. Yaks
886
Table 8: The Component Version Sets for the pick.order System -
Set B
Set A
System
-
Aver itii?j3 trinl5
G D Aver
1tii713
rrinj5
Set
GD Aver rrii?/3
c irinj5
GD
pick.order 95.-50 98.27 98.71 0.811 95.81 98.10 98.82 0.839 95.01 97.97 98.77 0.829
The choice of basic system architecture, i.e., flat 15, or two-level 3 x 5, was arbitrary. The results therefore suggest further questions of whether ive can do better using more, or using less, resources, or can we do as well using significantly less resources? A full exploration of these issues of performance versus number (and organization) of versions must be analytically driven: the possible alternatives are too many, and too varied to be satisfactorily explored by random probing. A crucial element of the analytical insight required is the relationship between majority performance and diversity. So some final experiments, triggered by "oddities" in the studies described above, were performed to shed some light on this key relationship. First, there is a question of resources (in the form of version-set size) and performance that becomes apparent when we analyzed the single example of a negative GDBW value found in the pick.order system. One of the three component sets had the highest GD value yet observed, 0.844, and the other two intraset GDs were similarly high. The performance details of these three component sets are given in Table 8, where "nznj5" is the overall majority for each set as they contain only five versions. These results provide further support for the notion that a two-level system architecture is counterproductive when GDBW values are negative. Notice that a superior performance is obtained from the "majority of 3 random" decision strategy in each of the component sets separately than that across these sets within the two-level system: majority of three randomly selected within sets A, 8, and C is 98.27, 98.40, and 97.97%, repectively, as against 97.41% for system pick.order, or 97.60% in system pickl5, both in Table 3. So, for this particular decision strategy, we can get almost a 1%better performance from a third of the resources, because the system is badly organized, and the negative value of GDBW signals that this is the case. And now to the issue of GD and majority variation with respect to version set size. The experiment involved using the pick heuristic to choose, from the complete version space, sets of versions from size two to 15 in steps of one. Summary performance data for the sets in this sequence are given in Table 9 where "moj9" is the majority of nine versions selected at random. From this sequence of results it can be seen that GD is indeed highest for the smallest sets and then seems to generally decrease as set size increases, although leveling out at about 0.785 after set size 10. The
Multiversion Neural-Net Systems
887
Table 9: A Study of Set Size Impact Version set size 2 GD ,785 maj. ma13 maj9 av 97.99
4
3
5
6
7
8
9
1 0 1 1 1 2 1 3 1 4 1 5
,904 ,884 ,870 ,851 ,798 ,808 ,803 ,804 ,780 .785 ,788 ,786 ,783 98.41 - 99.02 - 98.95 - 99.09 - 99.12 - 99.30 - 99.31 98.41 98.47 98.49 98.48 97.10 97.52 97.67 97.84 97.16 97.35 97.50 97.59 97.60 - - - - - - 99.09 99.09 98.98 99.01 99.05 98.99 98.99 93.59 94.75 95.34 95.75 94.46 94.95 95.25 95.52 94.80 95.03 95.21 95.37 95.44
Table 10: Performance Details of the pick5 System Version
Test failures
Test successes
R3-H9-W4 R1-HI 1-W5 R1-H50-W3 R5-H10-W3 T3-H9-W3
200 201 1521 176 234
9800 9799 8479 9824 9766
~
~
Prob(fai1) Generalization (%) 0.0200 0.0201 0.1521 0.0176 0.0234
98.00 97.99 84.79 98.24 97.66
~
Number of versions, n Coincident test failures Prob. exactly n versions fail 8140 1503 259 84 11 3
0.8140 0.1503 0.0259 0.0084 0.0011 0.0003
average generalization performance of individual networks shows less of a marked trend: from a peak at set size two it becomes a minimum at set size three (because a low performance RBF network happened to be the third one chosen), and from then on it exhibits a 1% variation between 94.75 and 95.75%. The trends in the decision strategies are even harder to assess. Note the overall majority result for the even numbered sets has not been tabulated because it is not directly comparable with a majority outcome from odd-numbered sets. These latter sets will always give a majority outcome whereas the former will not. However, there is a clear early maximum at set size five for both ”majority of three random” and overall majority (even when the adjacent majorites are considered-majority for set size 4 is 97.40%, and for set size 6 is 98.27%). But then as set size increases so does overall majority performance; then optimum system configuration would seem to be a compromise between the number of versions to be evaluated and the performance required. This crude analysis suggests that set size five may be a good com-
888
D. Partridge and W. B. Yates
Table 11:
Perfor-
mance of the pick5
System Majority
99.02
ttznj3
98.49
Average
95.34
GD
0.870
promise between resources used and performance obtained. This is the pick5 system, and its performance is given, in detail, in Tables 10 and 11. With the pick5 system we have done very well with few resourcesonly five versions. Notice that although the "picking" was performed freely throughout the complete version space, within these five versions there is at least one version from three of the four subcubes (there is no representative from TRBF subcube). And notice that one of these, the RBF network Rl-H50-W3, generalizes to only 84.79%. So the majority performance in Table 11 is a majority of only five networks, not 15 as with all previous systems. The GD value of 0.870 is the largest so far in a tested system, and supports the view that larger GD values are easier to obtain in small version sets. Yet higher majority performances seem to be obtained from large version sets-there have been a number of majorities of 15 versions greater than that for pick5, but they were all from 15-version systems. If these two key characteristics of multiversion systems d o indeed vary in opposite directions, the optimal system may lie at the "crossover" point. From the second tabulation in Table 10-tabulation of the coincident failures observed, i.e., how many test patterns failed on precisely I I version nets-the performance of further decision strategies can be extracted. The last line states that all five versions in pick5 failed on just three of the test patterns, and the line above it states that precisely four versions failed on 11 test patterns. By summing the entries in the third column for these two lines, 0.0003+ 0.0011 = 0.0014, we determine that the probability that exactly 5 or exactly 4 versions will fail on a random test pattern is 0.14%,. Put the other way around: the probability that four or five versions in agreement is the correct answer is 99.86%. Set against this encouraging figure, however, is the fact that for any particular test the outcome may not give as many as four versions in agreement-there may be a 2:3 split, and this can occur in two ways: two are wrong and three are correct, or vice versa. But the tabulation also gives tb.e probability of this 2:3 split occurring. It is the probability that exactly two versions fail (0.0259) together with the probability that exactly three versions fail (0.0084). This split will occur then on 3.43%)of occasions. Hence, the four-or-five-in-agreement strategy will yield an answer on 96.57T0of oc-
Multiversion Neural-Net Systems
889
Table 12: Performance of the Five Nets Chosen by the GA Technique
Majority 1mj3
Average CD
98.85 98.57 97.81 0.723
casions, and when it does, the answer it provides should be at least 99.86% reliable. In sum, using pick5 together with a four-or-five-in-agreement decision strategy, we have an approximately 99.9% system 97% of the time. And on the few occasions where four versions are not in agreement, there is a choice of accepting the lack of an answer or using the majority-vote strategy, which always delivers a result and averages out at 99.02% reliability. On most tests all five versions will agree (over 80% of occasions), in which case the answer will have 99.97% reliability. If the heuristic is allowed to choose one more version, a sixth network, it chooses a rationally trained MLP, R5-H9-W4, a version that generalizes to 97.83%. The six-version set has a lower GD of 0.851, but most significantly none of the 10,000 tests fail on a11 six uersions. This means that within the accuracy of the testing regime (i.e., approximately 0.01%) whenever all six versions agree on an answer, which is 80% of occasions, the result is 100% correct. For direct comparison purposes the GA technique was used to choose just five versions freely from the complete version space. A summary of its choices is given in Table 12. As can be seen the majority performance is not as good, despite this being the measure that guided the GA algorithm. The five nets chosen were all MLPs (four rationally trained and one randomly trained). The details of coincident failure are, however, quite similar to those of the pick5 system with respect to reliability when either four or five nets are in agreement. It is 99.81% for the five chosen by the GA, as against 99.86% for the pick5 system. 7.5 Generalization of Results. The comparable results for LIC4 mirror those for LICl at a lower level of generalization (90-9370). A similar space of 500 trained networks was generated and the same choice techniques were employed. Heuristic picking and the GA techniques produced 15-version systems with high diversity levels and majority-vote performances up to 4% better than the best single version. However, similarly good 5- and 6-version systems proved harder to find. The scarcity of small highly diverse sets may be a result of the increased complexity of the function.
890
D. Partridge and W. B. Y a k s
To make the OCR problem results comparable with the earlier work of Frey and Slate, we used 10,000 patterns for version-space generation, 6000 for choosing optimal subsets, and 4000 for final system test. By using both RBF networks and MLPs, and varying training set contents, weight initialization, and hidden unit numbers, we generated a 90-net version space for the problem. While we have not explored the possibilities in this space exhaustively, the initial results are promising. A large number of individual nets generalize at the 80% level. When our majority-vote decision strategy is applied to sets of 15 nets we obtain only a fraction of 1%improvement over the best network (typically 0.1%) and 1%improvement on the average. But this is n suorsf case I r s u l t , because our decision strategy and diversity measures distinguish only success and failure (which is all that there is with Boolean functions, for all failures on a given input must be identical). In the OCR problem there are 25 distinct failures; this provides much more scope for diversity and consequent performance enhancement---e.g., if two versions of 15 get the correct answer and the other 13 generate different wrong answers, then a majority in agreement is correct. To fully exploit this new increased opportunity for error distribution we need the basic statistics to distinguish between identical and different failures and to use a ”majority in agreement” decision strategy rather than simple ”majority of 15.” First results with such a strategy improves an 81.0‘3 majority-vote system (15 versions) to give an 84.8% result, and improves a 3-version system from 81.3 to 83.8%. Full exploitation of distinct failure diversity gives a 9-version system that tests at 93.5%. 8 Conclusions Multiversion systems constructed from neural nets trained from varied initial conditions can yield substantially better performance than any individual network. So there is a basis for multiversion system construction when highly reliable neural-net implementations are required. And dependent upon the constraints of particular applications (e.g., every use must yield a result, or a proportion of inconclusive outcomes is acceptable), it is possible to provide any desired level of reliability-even effectively 100% to the extent that any testing regime can provide assurances of this limiting case. Notice that the empirical evaluation, while never being able to deliver an absolute 100% guarantee, does provide well-founded statistical estimates of reliability, which is the best that can be expected of airy real system (as opposed to an abstraction that may be proved correct). Given that construction of multiversion systems is desirable, the question arises of how best to construct them, and in particular how do we engineer optimal systems? To begin to answer these questions we examined a variety of ways to construct two types of system: a single level
Multiversion Neural-Net Systems
891
system, using 15 versions in a single set, and a two-level system, using three separate sets each containing five versions. A space of versions was defined in a way that capitalized on the known diversity generating abilities of a number of training ”parameters.” The version space (of 500 individual versions) was then expected to exhibit maximally diverse subspaces. Three techniques for choosing 15 system versions were empirically examined, allowing both free choice throughout the whole space and choice restricted to each of the diverse subspaces. Of the three choice techniques-”choose the best,” pick with a maximum diversity heuristic, and use a genetic algorithm-the latter two techniques were superior to the first, with the ”heuristic picking” as preferable. It was as good or better than the genetic algorithm and it is simpler and quicker. However, there are many different ways in which the details of both of these latter techniques can be specified, and so our results should be taken as no more than a rough guide to the possibilities. One important consideration is that the ”heuristic picking” was a specific customized algorithm for maximizing GD value. The GA technique is a much more general framework that can be used for a variety of choice techniques. In addition, it was not optimized for this particular application, and will ”scale” better than the ”pick” heuristic for larger version spaces. Two decision strategies were used-overall majority and majority of three randomly selected (with ”equivalent” versions defined for the twolevel systems). With only one type of diversity to exploit (i.e., minimum coincident failure diversity) whatever its source, e.g., different net type or weight seed, the overall majority strategy of a single-set system was always superior to use of the ”equivalent” majority of majorities with a two-level system composed of precisely the same versions. This is to be expected because the two-level organization means that at most only three networks from each set of five can contribute to the final outcome, whereas in the single-set system, any subset of five can contribute whatever is most advantageous (i.e., from zero to all five) to the final system outcome. However, when we consider selecting only three versions to determine the system outcome, the story is not so straightforward. In particular, if a two-level system is composed of maximally diverse subsystems (e.g., each subsystem is chosen from a different diverse subspace) then the two-level system architecture is superior to the single-set equivalent. Finally, a study across a sequence of set sizes was performed to relax the 15-version system size, and to explore several system features that were suggested by the basic experiments. Using majority-vote decision strategies, system performance appears to be based on a nonsimple interaction of version-set diversity (GD), interset diversity (GDB), and average performance of the individual versions. In addition, very high GD values appear to be more readily obtainable in sets with few ver-
D. Partridge and W. B. Yates
892
sions, yet majority-vote performance tends to improve as more versions are included in the majority-i.e., improves with larger version sets. The study indicated that when a resource constraint is used (i.e., best performance with least versions) a set size of five was optimum. However, in the absence of this constraint, GD appears to level off and overall majority appears to keep edging u p as set size increases. This suggests the not too surprising optimum strategy is to use as many versions as possible while maintaining the GD level of the set. But because resources are always limited, a nlaxinlally diverse heuristically picked set of about five versions appears to be the optimum way to obtain a highly reliable system (always better than 99'2, and better than 99.9% on 8 0 9 of occasions). However, if 100% to better than a 0.01% is required, then a properly constructed six-version set will supply this result on 80% of occasions. And the use of larger test sets (again subject to available resources) can be used to increase the certainty of a 100% result to any desired level. With respect to tlie choice techniques, several of tlie experiments seem to suggest that both heuristic picking and the genetic algorithm ought to be modified to stop choosing further versions when conflicting constraints are optimized rather just choosing a predetermined number of versions. Finally, the application to other problems not only supports the generality of the techniques we propose but reveals further significant scope for improvement when we distinguish between identical and different failures. 111 general, the more we can disperse errors differently the less adverse impact they will have on the system performance.
Acknowledgment
~
- .-
Support from the GPSRC/DTI Safety-Critical Systems Programme (grant no. GR/H85427) is gratefully acknowledged.
Drucker, H., Corks, C., Jackel, L. D., LeCun, Y.,and Vapnik, V. 1991. Boosting and other ensemble methods. Neirrnl Co1izp. 6, 1289-1301. Frey, I? W., and Slate, D. J. 1991. Letter recognition using Holland-style adaptive classifiers. Mailr. Lcwrri. 6, 141-182. Knight, J. C., and Leveson, N . G. 1986. An experimental evaluation of the assumption of independence in multiversion programming. I E E E Trnrrs. SoftioLirc €rig. 12(1), 96-109. Lincoln, W. P., and Skrzypek, J. 1990. Synergy of clustering multiple back propagation networks. in A h i i i c r s irz N w r d [rif;irriintioii Procrssirig Systems 2, D. S. Touretzkv, ed., pp. 650457. Morgan Kaufmann, San Mateo, CA.
Multiversion Neural-Net Systems
893
Littlewood, B., and Miller, D. R. 1989. Conceptual modeling of coincident failures in multiversion software. IEEE Trans. Software Eng. 15(12), 1596-1614. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Proceedingsoftke1988ConnectionistModelsSummer School, D. Touretzky, G. E. Hinton, and T. J . Sejnowski, eds., pp. 133-143. Morgan Kaufmann, San Mateo, CA. Partridge, D. 1994. Network generalization differences quantified. Neural Networks (in press). Partridge, D., and Yates, W. B. 1994. Replicability of Neural Computing Experiments. Res. Rep. R305, University of Exeter, [email protected]. Pearlmutter, B. A., and Rosenfeld, R. 1991. Chaitin-Kolmogorov complexity and generalization in neural networks. In Advances in Neural Information Processing Systems 3, R. I? Lippmann, J. E. Moody, S. J. Hanson, and D. S. Touretzky, eds., pp. 925-931. Morgan Kaufmann, San Mateo, CA. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representation by error propagation. In Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 1: Foundations, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-362. MIT Press, Cambridge, MA. Yates, W. B., and Partridge, D. 1995. Use of methodological diversity to improve neural network generalisation. Neural Comp. Appl. (in press).
Received February 8, 1995; accepted October 3, 1995.
This article has been cited by: 2. Lean Yu, Shouyang Wang, Kin Keung Lai. 2010. Developing an SVM-based ensemble learning system for customer risk identification collaborating with customer relationship management. Frontiers of Computer Science in China 4:2, 196-203. [CrossRef] 3. Sarbast Rasheed, Daniel W Stashuk, Mohamed S Kamel. 2010. Integrating Heterogeneous Classifier Ensembles for EMG Signal Decomposition Based on Classifier Agreement. IEEE Transactions on Information Technology in Biomedicine 14:3, 866-882. [CrossRef] 4. Lior Rokach. 2010. Ensemble-based classifiers. Artificial Intelligence Review 33:1-2, 1-39. [CrossRef] 5. Jaejoon Lee, O.K. Ersoy. 2007. Consensual and Hierarchical Classification of Remotely Sensed Multispectral Images. IEEE Transactions on Geoscience and Remote Sensing 45:9, 2953-2963. [CrossRef] 6. Luiz S. Oliveira, Marisa Morita, Robert Sabourin. 2006. Feature selection for ensembles applied to handwriting recognition. International Journal of Document Analysis and Recognition (IJDAR) 8:4, 262-279. [CrossRef] 7. Romesh Ranawana, Vasile Palade. 2005. A neural network based multi-classifier system for gene identification in DNA sequences. Neural Computing and Applications 14:2, 122-131. [CrossRef] 8. Shuang Yang, Antony Browne. 2004. Neural network ensembles: combining multiple models for enhanced performance using a multistage approach. Expert Systems 21:5, 279-288. [CrossRef] 9. J.-H. Chen, C.-S. Chen. 2004. Reducing SVM Classification Time Using Multiple Mirror Classifiers. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:2, 1173-1183. [CrossRef] 10. W. Wang , P. Jones , D. Partridge . 2001. A Comparative Study of Feature-Salience Ranking TechniquesA Comparative Study of Feature-Salience Ranking Techniques. Neural Computation 13:7, 1603-1623. [Abstract] [PDF] [PDF Plus] 11. D. Partridge. 1997. The case for inductive programming. Computer 30:1, 36-41. [CrossRef]
ARTICLE
Communicated by Peter Dayan
Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm Randall C. O’Reilly Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15213 USA
The error backpropagation learning algorithm (BP) is generally considered biologically implausible because it does not use locally available, activation-based variables. A version of BP that can be computed locally using bidirectional activation recirculation (Hinton and McClelland 1988) instead of backpropagated error derivatives is more biologically plausible. This paper presents a generalized version of the recirculation algorithm (GeneRec), which overcomes several limitations of the earlier algorithm by using a generic recurrent network with sigmoidal units that can learn arbitrary inputloutput mappings. However, the contrastive Hebbian learning algorithm (CHL, also known as DBM or mean field learning) also uses local variables to perform error-driven learning in a sigmoidal recurrent network. CHL was derived in a stochastic framework (the Boltzmann machine), but has been extended to the deterministic case in various ways, all of which rely on problematic approximations and assumptions, leading some to conclude that it is fundamentally flawed. This paper shows that CHL can be derived instead from within the BP framework via the GeneRec algorithm. CHL is a symmetry-preserving version of GeneRec that uses a simple approximation to the midpoint or second-order accurate RungeKutta method of numerical integration, which explains the generally faster learning speed of CHL compared to BI? Thus, all known fully general error-driven learning algorithms that use local activation-based variables in deterministic networks can be considered variations of the GeneRec algorithm (and indirectly, of the backpropagation algorithm). GeneRec therefore provides a promising framework for thinking about how the brain might perform error-driven learning. To further this goal, an explicit biological mechanism is proposed that would be capable of implementing GeneRec-style learning. This mechanism is consistent with available evidence regarding synaptic modification in neurons in the neocortex and hippocampus, and makes further predictions. Neural Computation 8, 895-938 (1996) @ 1996 Massachusetts Institute of Technology
806
1 Introduction _ _ _ _
Randall C. O’Reilly _.
A long-standing objection to the error backpropagation learning algorithm (BP) (Rumelhart cf a/. 1986a) is that it is biologically implausible (Crick 1989; Zipser and Andersen 1988), principally because it requires error propagation to occur through a mechanism different from activation propagation. This makes the learning appear nonlocal, since the error terms are not locally available as a result of the propagation of activation through the network. Several remedies for this problem have been suggested, but none is fully satisfactory. One approach involves the use of an additional “error-network“ whose job is to send the error signals to the original network via an activation-based mechanism (Zipser and Rumelhart 1990; Tesauro 1990), but this merely replaces one kind of nonlocality with another, activation-based kind of nonlocality (and the problem of maintaining two sets of weights). Another approach uses a global reinforcement signal instead of specific error signals (Mazzoni ef nl. 1991), but this is not as powerful as standard backpropagation. The approach proposed by Hinton and McClelland (1988) is to use bidirectional activation recircuh tioil within a single, recurrently connected network (with symmetric weights) to convey error signals. To get this t o work, they used a somewhat unwieldy four-stage activation update process that works only for autoencoder networks. This paper presents a generalized version of the recirculation algorithm (GeneRec),which overcomes the limitations of the earlier algorithm by using a generic recurrent network with sigmoidal units that can learn arbitrary input/output mappings. The GeneRec algorithm shows how a general form of error backpropagation, which computes essentially the same error derivatives as the Almeida-Pineda ( A P ) algorithm (Almeida 1987; Pineda 1987a,b, 1988) for recurrent networks under certain conditions, can be performed in a biologically plausible fashion using only locally available activation variables. GeneRec uses recurrent activation flow to communicate error signals from the output layer to the hidden layers via symmetric weights. This weight symmetry is an important condition for computing the correct error derivatives. However, the ”catch-22” is that GeneRec does not itself preserve the symmetry of the weights, and when it is modified so that it does, it no longer follows the same learning trajectory as AP, even though it is computing essentially the same error gradient. The empirical relationship between the derivatives computed by GeneRec and AP backpropagation is explored in simulations reported in this paper. The GeneRec algorithm has much in common with the contrastive Hebhinn learning algorithm [CHL, also known as the mean field or deterministic Boltzmann machine (DBM) learning algorithm], which also uses locally available activation variables to perform error-driven learning in recurrently connected networks. This algorithm was derived originally for stochastic networks whose activation states can be described by the
Generalized Recirculation Algorithm
897
Boltzmann distribution (Ackley et al. 1985). In this context, CHL amounts to reducing the distance between two probability distributions that arise in two phases of settling in the network. This algorithm has been extended to the deterministic case through the use of approximations or restricted cases of the probabilistic one (Hinton 1989; Peterson and Anderson 1987), and derived without the use of the Boltzmann distribution by using the continuous Hopfield energy function (Movellan 1990). However, all of these derivations require problematic assumptions or approximations, leading some to conclude that CHL is fundamentally flawed for deterministic networks (Galland 1993; Galland and Hinton 1990). It is shown in this paper that the CHL algorithm can be derived instead as a variant of the GeneRec algorithm, which establishes a more general formal relationship between the BP framework and the deterministic CHL rule than previous attempts (Peterson 1991; Movellan 1990; LeCun and Denker 1991). This relationship means that all known fully general error-driven learning algorithms that use local activation-based variables in deterministic networks can be considered variations of the GeneRec algorithm (and thus, indirectly, of the backpropagation algorithm as well). An important feature of the GeneRec-based derivation of CHL is that the relationship between the learning properties of BP, GeneRec, and CHL can be more clearly understood. Another feature of this derivation is that it is completely general with respect to the activation function used, allowing CHL-like learning rules to be derived for many different cases. CHL is equivalent to GeneRec when using a simple approximation to a second-order accurate numerical integration technique known as the midpoint or second-order Runge-Kutta method, plus an additional symmetry preservation constraint. The implementation of the midpoint method in GeneRec simply amounts to the incorporation of the sending unit’s plus-phase activation state into the error derivative, and, as such, it amounts to an on-line (per pattern) integration technique. This method results in faster learning by reducing the amount of interference due to independently computed weight changes for a given pattern. This would explain why CHL networks generally learn faster than otherwise equivalent BP networks (e.g., Peterson and Hartman 1989; Movellan 1990). A comparison of optimal learning speeds for all variants of GeneRec and feedforward and AP recurrent backprop on four different problems is reported in this paper. The results of this comparison are consistent with the derived relationship between GeneRec and AP backpropagation, and with the interpretation of CHL as a symmetric, midpoint version of GeneRec, and thus provide empirical support for these theoretical claims. The finding that CHL did not perform well at all in networks with multiple hidden layers (”deep” networks), reported by Galland (1993), would appear to be problematic for the claim that CHL is performing a fully general form of backpropagation, which can learn in deep networks. However, I was unable to replicate the Galland (1993) failure to learn the
Randall C. O’Reilly
898
”family trees” problem (Hinton 1986) using CHL. In simulations reported below, I show that by simply increasing the number of hidden units (from 12 to 18), CHL networks can learn the problem with 100% success rate, in a number of epochs on the same order as backpropagation. Thus, the existing simulation evidence seems to support the idea that CHL is performing a form of backpropagation, and not that it is a fundamentally flawed approximation to the Boltzmann machine as has been argued (Galland 1993; Galland and Hinton 1990). Given that the GeneRec family of algorithms encompasses all known ways of performing error-driven learning using locally available activation variables, it provides a promising framework for thinking about how error-driven learning might be implemented in the brain. To further this goal, an explicit biological mechanism capable of implementing GeneRecstyle learning is proposed. This mechanism is consistent with available evidence regarding synaptic modification in neurons in the neocortex and hippocampus, and makes further predictions.
2 Introduction to Algorithms and Notation
-
~~
In addition to the original recirculation algorithm (Hinton and McClelland 1988), the derivation of GeneRec depends on ideas from several standard learning algorithms, including feedforward error backpropagation (BP) (Rumelhart ef al. 1986a) with the cross-entropy error term (Hinton 1989a); the Almeida-Pineda (AP) algorithm for error backpropagation in a recurrent network (Almeida 1987; Pineda 1987a,b, 1988), and the contrastive Hebbian learning algorithm (CHL) used in the Boltzmann machine and deterministic variants (Ackley et al. 1985; Hinton 1989b; Peterson and Anderson 1987). The notation and equations for these algorithms are summarized in this section, followed by a brief overview of the recirculation algorithm in the next section. This provides the basis for the development of the generalized recirculation algorithm presented in subsequent sections. 2.1 Feedforward Error Backpropagation. The notation for a threelayer feedforward backpropagation network uses the symbols shown in Table 1. The target values are labeled t k for output unit k, and the patternwise sum is dropped since the subsequent derivations do not depend on it. The cross-entropy error formulation (Hinton 1989a) is used because it eliminates an activation derivative term in the learning rule which is also not present in the recirculation algorithm. The cross-entropy error is defined as
E
=
c k
tk
log
Ok
+ (1
-
fk)
log(1 - Ok)
899
Generalized Recirculation Algorithm Table 1: Variables in a Three-Layer Backpropagation Network.“ Layer (index)
Net input
Activation
Input (s) Hidden ( k ) output (0)
-
s, = stimulus input
v/= CIqsJ %=
Wlkhl
4 = 471,)
ok = d v k )
“ ~ ( 7is) the standard sigmoidal activation function o ( T / )= 1 / ( 1 e-q)).
+
and the derivative of E with respect to a weight into an output unit is
dE -
-
-
dE d O k d11k _-__
where ~ ’ ( r l k )is the first derivative of the sigmoidal activation function with respect to the net input: O k ( 1 - o k ) , which is canceled by the denominator of the error term. To train the weights into the hidden units, the impact a given hidden unit has on the error term needs to be determined:
which can then be used to take the derivative of the error function with respect to the input to hidden unit weights:
dE dE dk, dql - ____ awq ah, dwij
~
which provides the basis for adapting the weights. 2.2 Almeida-Pineda Recurrent Backpropagation. The AP version of backpropagation is essentially the same as the feedforward one described above except that it allows for the network to have recurrent (bidirectional) connectivity. Thus, the network is trained to settle into a stable activation state with the output units in the target state, based on a given input pattern clamped over the input units. This is the version of BP that the GeneRec algorithm, which also uses recurrent connectivity, approximates most closely. The same basic notation and cross-entropy error term
900
Randall C. O’Reilly
as described for feedforward BP are used to describe the AP algorithm, except that the net input terms (rl) can now include input from any other unit in the network, not only those in lower layers. The activation states in AP are updated according to a discrete-time approximation of the following differential equation, which is integrated over time with respect to the net input terms:’
This equation can be iteratively applied until the network settles into a stable equilibrium state (i.e., until the change in activation state goes below a small threshold value), which it will provably do if the weights are symmetric (Hopfield 1984), and often even if they are not (Galland and Hinton 1991). In the same way that the activations are iteratively updated to allow for recurrent activation dynamics, the error propagation in the AP algorithm is also performed iteratively. The iterative error propagation in AP operates on a new variable y,, which represents the current estimate of the derivative of the error with respect to the net input to the unit, OE / O r / , . This variable is updated according to
where 1, is the externally ”injected” error for output units with target activations. This equation is iteratively applied until the y, variables settle into an equilibrium state (i.e., until the change in y, falls below a small threshold value). The weights are then adjusted as in feedforward BP (2.4), with y, providing the %Eji)r/l term. 2.3 Contrastive Hebbian Learning. The contrastive Hebbian learning algorithm (CHL) used in the stochastic Boltzmann machine and deterministic variants is based on the differences between activation states in two different phases. As in the AP algorithm, the connectivity is recurrent, and activation states (in the deterministic version) can be computed according to equation 2.5. As will be discussed below, the use of locally computable activation differences instead of the nonlocal error backpropagation used in the BP and AP algorithms is more biologically plausible. The GeneRec algorithm, from which the CHL algorithm can be derived as a special case, uses the same notion of activation phases. The two phases of activation states used in CHL are the plus phase states, which result from both input and target being presented to the network, and provide a training signal when compared to the minus phase ‘It IS also possible to incrementaily update the activations instead of the net inputs, but this limits the ability of units to change their state rapidly, since the largest activation ~ . a l u eis 1, while net inputs arc not bounded.
901
Generalized Recirculation Algorithm Table 2: Equilibrium Network Variablesa Layer
Phase
Net input
Activation
aEquilibrium network variables in a three-layer network having reciprocal connectivity between the hidden and output layers with symmetric weights (zqk = wk,), and phases over the output units such that the target is clamped in the plus phase, and not in the minus phase. u ( 7 ) is the standard sigmoidal activation function.
activations, which result from just the input pattern being presented. The equilibrium network variables (i.e., the states after the iterative updating procedure of equation 2.5 has been applied) for each phase in such a system are labeled as in Table 2. The CHL learning rule for deterministic recurrent networks can be expressed in terms of generic activation states a (which can be from any layer in the network) as follows: 1
-Aw,
=
E
(QU;)- (a;aI-)
(2.7)
where a, is the sending unit and al is the receiving unit. Thus, CHL is simply the difference between the pre- and postsynaptic activation coproducts in the plus and minus phases. Each coproduct term is equivalent to the derivative of the energy or "harmony" function for the network with respect to the weight:
E
alal~s
=
I
(2.8)
'
and the intuitive interpretation of this rule is that it decreases the energy or increases the harmony of the plus-phase state and vice versa for the minus-phase state (see Ackley et al. 1985; Peterson and Anderson 1987; Hinton 1989b; Movellan 1990, for details). 3 The Recirculation Algorithm
The original Hinton and McClelland (1988) recirculation algorithm is based on the feedforward BP algorithm. The GeneRec algorithm, which is more closely related to the AP recurrent version of backpropagation, borrows two key insights from the recirculation algorithm. These insights provide a means of overcoming the main problem with the standard backprop formulation from a neural plausibility standpoint, which
Randall C. O'Reilly
902
is the manner in which a hidden unit computes its own error contribution. This is shown in equation 2.3. The problem is that the hidden unit is required to access the computed quantity ( a E / i ) O k ) , which depends on variables at the output unit only. This is the crux of the nonlocaiity of error information in backpropagation. The first key insight that can be extracted from the recirculation algorithm is that equation 2.3 can be expressed as the difference between two terms, each of which look much like a net-input to the hidden unit:
Thus, instead of having a separate error-backpropagation phase to communicate error signals, one can think in terms of standard activation propagation occurring via reciprocal (and symmetric) weights that come from the output units to the hidden units. The error contribution for the hidden unit can then be expressed in terms of the difference between two net-input terms. One net-input term is just that which would be received when the output units had the target activations f k clamped on them, and the other is that which would be received when the outputs have their feedforward activation values 01. To take advantage of a net-input based error signal, Hinton and McClelland (1988) used an autoencoder framework, with two pools of units: i G i h l e and hiddeli. The visible units play the role of both the input layer and the output layer. Each training pattern, which is its own target, is presented to the visible units, which then project to a set of hidden units, which then feed back to the same visible units. The input from the hidden units changes the state of the visible units, and this new state is then fed through the system again, hence the name recirculation (see Fig. 1). As a result, the visible units have two activation states, equivalent to f k (or s,) and ok, and the hidden units have two activation states, the first of which is a function of '1; = xk f k z u i r , which can be labeled h;, and the second of which corresponds to h,. The second key insight from the recirculation algorithm is that instead of computing a difference in net-inputs in equation 3.1, one can use a difference of activations via the following approximation to equation 2.4:
zz
-(k,+ - h i ) f k
(3.2)
The difference between activation values instead of the net-inputs in equation 3.2 can be used since Hinton and McClelland (1988)imposed an additional constraint that the difference between the reconstructed and the target visible-unit states (and therefore the difference between q;* and ri,) be kept small by using a "regression" function in updating the visible units. This function assigns output state (computed at time T = 2) as a
903
Generalized Recirculation Algorithm
Recirculation (Hinton & McClelland, 1988) T=3
T =0
Target Pattern
Figure 1: The recirculation algorithm, as proposed by Hinton and McClelland (1988). Activation is propagated in four steps (T = 0 - 3). T = 0 is the target pattern clamped on the visible units. T = 1 is the hidden units computing their activation as a function of the target inputs. T = 2 is the visible units computing their activations as a function of the hidden unit state at T = 1. T = 3 is the hidden unit state computed as a function of the reconstructed visible unit pattern.
weighted average of the target output and the activation computed from the current net input from the hidden units:’
Thus, the difference in a hidden-unit’s activation values is approximately equivalent to the difference in net-inputs times the slope of the activation function at one of the net-input values [a’(7f1)],as long as the linear approximation of the activation function given by the slope is reasonably valid. Even if this is not the case, as long as the activation function is monotonic, the error in this approximation will not affect the sign of the ’Note that Hinton and McClelland (1988) used linear output units to avoid the activation function derivative on the output units, whereas cross-entropy is being used here to avoid this derivative. Thus, the function f ( ~in)equation 3.3 will either be linear or a sigmoid, depending on which assumptions are being used.
904
Randall C. O'Reilly
resulting error derivative, only the magnitude. Nevertheless, errors in magnitude can lead to errors in sign over the pattern-wise sum. Since the difference of activations in equation 3.2 computes the derivative of the activation function implicitly, one can use the resulting learning rule for any reasonable (monotonic) activation function.3 This can be important for cases where the derivative of the activation function is difficult to compute. Further, the activation variable might be easier to map onto the biological neuron, and it avoids the need for the neuron to compute its activation derivative. Note that due to the above simplification, the learning rule for the recirculation algorithm (based on equation 3.2) is the same for both hidden and output units, and is essentially the delta-rule. This means that locally available activation states of the pre- and postsynaptic units can be used to perform error-driven learning, which avoids the need for a biologically troublesome error backpropagation mechanism that is different from the normal propagation of activation through the network. 4 Phase-Based Learning and Generalized Recirculation
While the recirculation algorithm is more biologically plausible than standard feedforward error backpropagation, it is limited to learning autoencoder problems. Further, the recirculation activation propagation sequence requires a detailed level of control over the flow of activation through the network and its interaction with learning. However, the critical insights about computing error signals using differences in net input (or activation) terms can be applied to the more general case of a standard three-layer network for learning input to output mappings. This section presents such a generalized recirculation or GeneRec algorithm, which uses standard recurrent activation dynamics (as used in the AP and CHL algorithms) to communicate error signals instead of the recirculation technique. Instead of using the four stages of activations used in recirculation, GeneRec uses two activation phases as in the CHL algorithm described above. Thus, in terms of activation states, GeneRec is identical to the deterministic CHL algorithm, and the same notation is used to describe it. The learning rule for GeneRec is simply the application of the two key insights from the recirculation algorithm to the AP recurrent backpropagation algorithm instead of the feedforward BP algorithm, which was the basis of the recirculation algorithm. If the recurrent connections between the hidden and output units are ignored so that the error on the output layer is held constant, it is easy to show that the fixed point solution to the iterative AP error updating equation 2.6 (i.e., where dy,/dt = 0 for hidden unit error y,) is of the same form as feedforward backpropaga'Note that the output units have to use a sigmoidal function in order for the crosstmtropy function to cancel out the derivative.
Generalized Recirculation Algorithm
905
tion. This means that the same recirculation trick of computing the error signal as a difference of net input (or activation) terms can be used: y;”
=
a’(qj) c W k j y p k r
M
471,)
1
[c F w,.] Wk,fk -
(4.1)
k
Thus, assuming constant error on the output layer, the equilibrium error gradient computed for hidden units in AP is equivalent to the difference between the GeneRec equilibrium net input states in the plus and minus phases. Note that the minus phase activations in GeneRec are identical to the AP activation states. The difference of activation states can be substituted for net input differences times the derivative of the activation function by the approximation introduced in recirculation, resulting in the following equilibrium unit error gradient:
yp” Z h; -hl(4.2) Note that while the hidden unit states in GeneRec also reflect the constant net input from the input layer (in addition to the output-layer activations that communicate the necessary error gradient information), this cancels out in the difference computation of equation 4.2. However, this constant input to the hidden units from the input layer in both phases can play the role of the regression update equation 3.3 in recirculation. To the extent that this input is reasonably large and it biases the hidden units toward one end of the sigmoid or the other, this bias will tend to moderate the differences between h- and h:, making their difference a reasonable approximation to the difjerences of their respective net inputs times the slope of the activation function. While the analysis presented above is useful for seeing how GeneRec equilibrium activation states could approximate the equilibrium error gradient computed by AP, the AP algorithm actually performs iterative updating of the error variable (y,). Thus, it would have to be the case that iterative updating of this single variable is equivalent to the iterative updating of each of the activation states (plus and minus) in GeneRec, and then taking the difference. This relationship can be expressed by writing GeneRec in the AP notation. First, we define two components to the error variable y,, which are effectively the same as the GeneRec plus and minus phase net inputs (ignoring the net input from the input units, which is subtracted out in the end anyway): (4.3) (4.4)
906
Randall C. O'Reilly
Then, we approximate the fixed point of y, with the difference of the fixed points of these two variables: (4.5) which can be approximated by the subtraction of the GeneRec equilibrium activation states as discussed previously. Unfortunately, the validity of this approximation is not guaranteed by any proof that this author is aware of. However, there are several properties of the equations that lend some credibility to it. First, the part that is a function of f k on the output units, ,y: is effectively a constant, and the other part, y,-, is just the activation updating procedure that both GeneRec and AP have in common. Further, given that the fixed point solutions of the GeneRec and AP equations are the same when recurrent influences are ignored, and the pattern of recurrent influences is given by the same set of weights, it is likely that the additional effects of recurrence will be in the same direction for both GeneRec and AP. However, short of a proof, these arguments require substantiation from simulation results comparing the differences between the error derivatives computed by GeneRec and AP. The results presented later in the paper confirm that GeneRec computes essentially the same error derivatives as AP in a threelayer network (as long as the weights are symmetric). This approximation deteriorates only slightly in networks with multiple hidden layers, where the effects of recurrence are considerably greater. As in the recirculation algorithm, it is important for the above approximation that the weights into the hidden units from the output units have the same values as the corresponding weights that the output units receive from the hidden units. This is the familiar symmetric weight constraint, which is also necessary to prove that a network will settle into a stable equilibrium (Hopfield 1984). We will revisit this constraint several times during the paper. However, for the time being, we will assume that the weights are symmetric. Finally, virtually all deterministic recurrent networks including GeneRec suffer from the problem that changes in weights based on gradient information computed on equilibrium activations might not result in the network settling into an activation state with lower error the next time around. This is due to the fact that small weight changes can affect the settling trajectory in unpredictable ways, resulting in an entirely different equilibrium activation state than the one settled into last time. While it is important to keep in mind the possibility of discontinuities in the progression of activation states over learning, there is some basis for optimism on this issue. For example, in his justification of the deterministic version of the Boltzmann machine (DBM) Hinton (1989b)supplies several arguments (which are substantiated by a number of empirical findings) justifying the assumption that small weight changes will generally lead to a contiguous equilibrium state of unit activities in a recurrent network.
Generalized Recirculation Algorithm
907
To summarize, the learning rule for GeneRec that computes the error backpropagation gradient locally via recurrent activation propagation is the same as that for recirculation, having the form of the delta-rule. It can be stated as follows in terms of a sending unit with activation ai and a receiving unit with activation uj:
As shown above, this learning rule will compute the same error derivatives as the AP recurrent backpropagation procedure under the following conditions: 0
0
0
Iterative updating of the error term (yj) can be approximated by the separate iterative updating of the two activation terms (k; and kl-) and then taking their difference. The reciprocal weights are symmetric (wjk = wk,). Differences in net inputs times the activation function derivative can be approximated by differences in the activation values.
5 The Relationship between GeneRec and CHL
The GeneRec learning rule given by equation 4.6 and the CHL learning rule given by equation 2.7 are both simple expressions that involve a difference between plus and minus-phase activations. This raises the possibility that they could somehow be related to each other. Indeed, as described below, there are two different ways in which GeneRec can be modified that, when combined, yield equation 2.7. The GeneRec learning rule can be divided into two parts, one of which represents the derivative of the error with respect to the unit (the difference of that unit's plus and minus activations, a: - ul-), and the other which represents the contribution of a particular weight to this error term (the sending unit's activation, 0 ; ) . It is the phase of this latter term that is the source of the first modification. In standard feedforward or AP recurrent backpropagation, there is only one activation term associated with each unit, which is equivalent to the minus-phase activation in the GeneRec phase-based framework. Thus, the contribution of the sending unit is naturally evaluated with respect to this activation, and that is why a; appears in the GeneRec learning rule. However, given that GeneRec has another activation term corresponding to the plus-phase state of the units, one might wonder if the derivative of the weight should be evaluated with respect to this activation instead. After all, the plus phase activation value will likely be a more accurate reflection of the eventual contribution of a given unit after other weights in the network are updated. In some sense, this value anticipates the weight changes that will lead to having the correct target values activated, and learning based on it might avoid some interference.
Randall C. O'Reilly
908
On the other hand, the minus-phase activation reflects the actiiai contribution of the sending unit to the current error signal, and it seems reasonable that credit assignment should be based on it. Given that there are arguments in favor of both phases, one approach would be to simply use the average of both of them. Doing this results in the following weight update rule: 1 1 LT11 - - ( q+a,-)(a; - a ; ) (5.1) F
"-2
This is the first way in which GeneRec needs to be modified to make it equivalent to CHL. As will be discussed in detail below, this modification of GeneRec corresponds to a simple approximation of the midpoint or second-order accurate Runge-Kutta integration technique. The consequences of the midpoint method for learning speed will be explored in the simulations reported below. The second way in which GeneRec needs to be modified concerns the issue of symmetric weights. For CeneRec to compute the error gradient via reciprocal weights, these weights need to have the same value (or at least the same relative magnitudes and signs) as the forward-going weights. However, the basic GeneRec learning rule (equation 4.6) does not preserve this symmetry: While simulations reported below indicate that GeneRec can learn and settle into stable attractors without explicitly preserving the weight symmetry, a symmetry-preserving version of GeneRec would guarantee that the computed error derivatives are always correct. One straightforward way of ensuring weight symmetry is simply to use the average (or more simply, the sum) of the weight changes that would have been computed for each of the reciprocal weights separately, and apply this same change to both weights. Thus, the symmetric GeneRec learning rule is 1 PAZU,,
= a,(a;
F
=
-
a;) + a,-(a: - a , - )
( u p ; + u,a?)
-
2a-u;
(5.3)
Note that using this rule will not result in the weights being updated in the same way as AP backpropagation, even though the error derivatives computed on the hidden units will still be the same. Thus, even the symmetry-preserving version of GeneRec is not identical to AP backpropagation. This issue will be explored in the simulations reported below. If both the midpoint method and symmetry preservation versions of GeneRec are combined, the result is the CHL algorithm: 1 1 -AZV,, = - [ I f f ; R,)(ff: - a;) + (a; + a,)(a? - ff;)] 2 = (a:n;) - (ff;a,) (5.4)
+
Generalized Recirculation Algorithm
909
Note that LeCun and Denker (1991) pointed out that CHL is related to a symmetric version of the delta rule (i.e., GeneRec), but they did not incorporate the midpoint method, and thus were only able to show an approximation that ignored this aspect of the relationship between CHL and GeneRec. The above derivation of CHL is interesting for several reasons. First, it is based on error backpropagation (via GeneRec), and not some kind of approximation to a stochastic system. This eliminates the problems associated with considering the graded activations of units in a deterministic system to be expected values of some underlying probability distribution. For example, to compute the probability of a given activation state, one needs to assume that the units are statistically independent (see Hinton 1989b; Peterson and Anderson 1987). While Movellan (1990) showed that CHL can be derived independent of the Boltzmann distribution and the concomitant mean-field assumptions, his derivation does not apply when there are hidden units in the network. Further, the consequences of the relationship between CHL as derived from GeneRec and standard error backpropagation (i.e., that CHL uses the faster midpoint integration method and imposes a symmetrypreservation constraint) should be apparent in the relative learning properties of these algorithms. Thus, this derivation might explain why CHL networks tend to learn faster than equivalent backprop networks. Finally, another advantage of a derivation based on the BP framework is that it is sufficiently general as to allow CHL-like learning rules to be derived for a variety of different activation functions or other network properties. 6 The Midpoint Method and the GeneRec Approximation to It
~
As was mentioned above, the use of the average of both the minus and plus phase activations of the sending unit in the GeneRec learning rule corresponds to an approximation of a simple numerical integration technique for differential equations known as the midpoint or second-order accurate Runge-Kutta method (Press et al. 1988). The midpoint method attains second-order accuracy without the explicit computation of second derivatives by evaluating the first derivatives twice and combining the results so as to minimize the integration error. It can be illustrated with the following simple differential equation:
The simplest way in which the value of the variable y can be integrated is by using a difference equation approximation to the continuous differential equation: Yt+l
= yt
+ ef'(yt)
f
O(E2)
(6.2)
910
Randall C. O’Reilly
Figure 2: The midpoint method. At each point, a trial step is taken along the derivative at that point, and the derivative is recomputed at the midpoint between the point and the trial step. This derivative is then used to take the actual step to the next point, and so on.
with a step size of 6 , and an accuracy to first order in the Taylor series expansion off(y,) (and thus an error term of order f 2 ) . This integration technique is known as theforward Eider method, and is commonly used in neural network gradient descent algorithms such as BP. By comparison with equation 6.2, the midpoint method takes a ”trial” step using the forward Euler method, resulting in an estimate of the next function value (denoted by the asterisk):
This estimate is then used to compute the actual step, which is the derivative computed at a point halfway between the current yr and estimated t/;-, values (see Fig. 2):
In terms of a Taylor series expansion of the functionf(yt) at the point yt, evaluating the derivative at the midpoint as in equation 6.4 cancels out
Generalized Recirculation Algorithm
911
the first-order error term [O(E*)], resulting in a method with second-order accuracy (Press et al. 1988). Intuitively, the midpoint method is able to "anticipate" the curvature of the gradient, and avoid going off too far in the wrong direction. There are a number of ways the midpoint method could be applied to error backpropagation. Perhaps the most "correct" way of doing it would be to run an entire batch of training patterns to compute the trial step weight derivative, and then run another batch of patterns with the weights half way between their current and the trial step values to get the actual weight changes to be made. However, this would require roughly twice the number of computations per weight update as standard batchmode backpropagation, and the two passes of batch-mode learning is not particularly biologically plausible. The GeneRec version of the midpoint method as given by equation 5.1 is an approximation to the "correct" version in two respects: 1. The plus-phase activation value is used as an on-line estimate of the activations that would result from a forward Euler step over the weights. This estimate has the advantages of being available without any additional computation, and it can be used with online weight updating, solving both of the major problems with the "correct" version. The relationship between the plus-phase activation and a forward Euler step along the error gradient makes sense given that the plus-phase activation is the "target" state, which is therefore in the direction of reducing the error. Appendix A gives a more formal analysis of this relationship. This analysis shows that the plus phase activation, which does not depend on the learning rate parameter, typically overestimates the size of a trial Euler step. Thus, the use of the plus-phase activation means that the precise midpoint is not actually being computed. Nevertheless, the anticipatory function of this method is still served when the trial step is exaggerated. Indeed, the simulation results described below indicate that it can actually be advantageous in certain tasks to take a larger trial step. 2. The midpoint method is applied only to the portion of the derivative that distributes the unit's error term among its incoming weights, and not to the computation of the error term itself. Thus, the error term (a' - a]:) from equation 5.1 is the same as in the standard forward Euler integration method, and only the sending activations are evaluated at the midpoint between the current and the trial step: $(a; a:). This selective application of the midpoint method is particularly efficient for the case of on-line backpropagation because a midpoint value of the unit error term, especially on a single-pattern basis, will typically be smaller than the original error value, since the trial step is in the direction of reducing error.
+
Randall C. O’Reilly
Y12
Thus, using the midpoint error value would actually slow learning by reducing the effective learning rate.
To summarize, the advantage of using the approximate midpoint method represented by equation 5.1 is that it is so simple to compute, and it appears to reliably speed up learning while still using on-line learning. While other more sophisticated integration techniques have been developed for batch-mode BP (see Battiti 1992 for a review), they typically require considerable additional computation per step, and are not very biologically plausible. 6.1 The GeneRec Approximate Midpoint Method in Backpropagation. To validate the idea that CHL is equivalent to GeneRec using the approximation to the midpoint method described above, this approximation can be implemented in a standard backpropagation network and thc relative learning speed advantage of this method compared for the two different algorithms. If similar kinds of speedups are found in both GeneRec and backpropagation, this would support the derivation of CHL as given in this paper. Such comparisons are described in the following simulation section. There are two versions of the GeneRec approximate midpoint method that are relevant to consider. One is a weight-based method that computes the sending unit’s trial step activation (11;) based on the trial step weights, and the other is a simpler approximation that uses the unit’s error derivative to estimate the trial step activation. In both cases, the resulting trial step activation state 11; is averaged with the current activation value h, to obtain a midpoint activation value, which is used as the sending activation state for backpropagation weight updating:
This corresponds to the GeneRec version of the midpoint method as given in equation 5.1. Note that in a three-layer network, only the hidden-tooutput weights are affected, since there is no trial step activation value for the input units. In the BP weight-based midpoint approximation, the trial step activation is computed as if the weights had been updated by the trial step weight error derivatives as follows:
/I;
=
(T(
I/,*
)
(6.6)
where t t < is a learning-rate-like constant that determines the size of the trial step taken. Note that, in keeping with the GeneRec approximation, the actual learning rate f is not included in this equation. Thus, depending on the relative sizes of ct. and F, the estimated trial step activation
Generalized Recirculation Algorithm
913
given by equation 6.6 can overestimate the size of an actual trial step activation. In order to evaluate the effects of this overestimation, a range of crs values are explored. The BP unit-error-based method uses the fact that each weight will be changed in proportion to the derivative of the error with respect to the unit’s net input, dE/&/,, to avoid the additional traversal of the weights:
where F is the number of receiving weights (fan-in). In this case, the trial step size parameter tfs also reflects the average activity level over the input layer, since each input weight would actually be changed by an amount proportional to the activity of the sending unit. The comments regarding ttSabove also apply to this case. Note that it is the unit-error-based version of the midpoint method that most closely corresponds to the version used in CHL, since both are based on the error derivative with respect to the unit, not the weights into the unit. As the simulations reported below indicate, the midpoint method can speed up on-line BP learning by nearly a factor of two. Further, the unit-error-based version is quite simple and requires little extra computation to implement. Finally, while the unit-error-based version could be applied directly to Almeida-Pineda backpropagation, the same is not true for the weight-based version, which would require an additional activation settling based on the trial step weights. Thus, to compare these two ways of implementing the approximate midpoint method, the results presented below are for feedforward backpropagation networks. 7 Simulation Experiments The first set of simulations reported in this section is a comparison of the learning speed between several varieties of GeneRec (including symmetric, midpoint, and CHL) and BP with and without the midpoint integration method. This gives a general sense of the comparative learning properties of the different algorithms, and provides empirical evidence in support of the predicted relationships among the algorithms investigated. In the second set of simulations, a detailed comparison of the weight derivatives computed by the Almeida-Pineda version of backpropagation and GeneRec is performed, showing that they both compute the same error derivatives under certain conditions. 7.1 Learning Speed Comparisons. While it is notoriously difficult to perform useful comparisons between different learning algorithms, such comparisons could provide some empirical evidence necessary for evaluating the theoretical claims made above in the derivation of the
914
Randall C. O’Reilly
GeneRec algorithm and its relationship to CHL. Note that the intent of this comparison is not to promote the use of one algorithm over another, which would require a much broader sample of commonly used speedup techniques for backpropagation. The derivation of GeneRec based on AP backpropagation and its relationship with CHL via the approximate midpoint method makes specific predictions about which algorithms will learn faster and more reliably than others, and, to the extent that the following empirical results are consistent with these predictions, this provides support for the above analysis. In particular, it is predicted that GeneRec will be able to solve difficult problems in roughly the same order of epochs as the AP algorithm, and that weight symmetry will play an important role in the ability of GeneRec to solve problems. Further, it is predicted that the midpoint versions of both GeneRec and backpropagation will learn faster than the standard versions. Overall, the results are consistent with these predictions. It is apparent that GeneRec networks can learn difficult tasks, and further that the midpoint integration method appears to speed up learning in both GeneRec and backpropagation networks. This is consistent with the idea that CHL is equivalent to GeneRec using this midpoint method. Finally, adding the symmetry preservation constraint to GeneRec generally increases the number of networks that solve the task, except in the case of the 4 - 2 4 encoder for reasons that are explained below. This is consistent with the idea that symmetry is important for computing the correct error derivatives. Four different simulation tasks were studied: XOR (with 2 hidden units), a 4-2-4 encoder, the ”shifter” task (Galland and Hinton 1991), and the ”family trees” task of Hinton (1986) (with 18 units per hidden and encoding layer). All networks used 0-to-1 valued sigmoidal units. The hackpropagation networks used the cross-entropy error function, with an ”error tolerance” of 0.05, so that if the output activation was within 0.05 of the target, the unit had no error. In the GeneRec networks, activation values were bounded between 0.05 and 0.95. In both the GeneRec and AP backpropagation networks, initial activation values were set to 0, and a step size (dt) of 0.2 was used to update the activations. Settling was stopped when the maximum change in activation (before multiplying by dt) was less than 0.01. 50 networks with random initial weights (symmetric for the GeneRec networks) were run for XOR and the 4-2-4 encoder, and 10 for the shifter and family trees problems. The training criterion for XOR and the 4-2-4 encoder was 0.1 total-sum-of-squares error, and the criterion for the shifter and family trees problems was that all units had to be on the right side of 0.5 for all patterns. Networks were stopped after 5000 epochs if they had not yet solved the XOR, 4-2-4 encoder and shifter problems, and 1000 epochs for family trees. A simple one-dimensional grid search was performed over the learning rate parameter to determine the fastest average learning speed for a given algorithm on a given problem. For XOR, the 4-24 encoder, and
Generalized Recirculation Algorithm
915
Table 3: Relationship of the Algorithms Testeda
Euler NonSym SYm
Midpoint NonSym Sym
Err Method
FF vs. Rec
BP
FF Rec
BP AP
FF Rec
-
-
-
-
GR
GRSym
GRMid
CHL
Act Diff
-
BP Mid
-
-
-
~
nRelationship of the algorithms tested with respect to the use of local activations vs. explicit backpropagation (BP, Act Diff) to compute error derivatives, feedfonvard vs. recurrent (FF, Rec), forward Euler vs. the midpoint method (Euler, Midpoint), and weight symmetrization (NonSym, Sym). GR is GeneRec.
the shifter tasks, the grid was at no less than 0.05 increments, while a grid of 0.01 was used for the family trees problem. No momentum or any other modifications to generic backpropagation were used, and weights were updated after every pattern, with patterns presented in a randomly permuted order every epoch. The results presented below are from the fastest networks for which 50% or more reached the training criterion. This is really important only for the XOR problem, since the algorithms did not typically get stuck on the other problems. The algorithms compared were as follows (see Table 3):
BP Standard feedforward error backpropagation using the crossentropy error function. AP Almeida-Pineda backpropagation in a recurrent network using the cross-entropy error function. BP Mid Wt Feedforward error backpropagation with the weightbased version of the approximate midpoint (equation 6.6). Several different values of the trial step size parameter ets were used to determine the effects of overestimating the trial step as is the case with GeneRec. The values were 1, 5, 10, and 25 for XOR and the 4 - 2 4 encoder, 0.5,1, and 2 for the shifter problem, and 0.05,0.1, 0.2, and 0.5 for the family trees problem. The large trial step sizes resulted in faster learning in small networks, but progressively smaller step sizes were necessary for the larger problems. BP Mid Un Feedforward error backpropagation with the unit-error based version of the midpoint integration method (equation 6.7). The same trial step size parameters as in BP Mid Wt were used. GR The basic GeneRec algorithm (equation 4.6). GR Sym GeneRec with the symmetry preservation constraint (equation 5.3). GR Mid GeneRec with the approximate midpoint method (equation 5.1).
Randall C. O'Reilly
916
Table 4: Results for the XOR Problem" Algorithm
f
BP
Epcs SEM
1.95 37 1.40 35
305 164
58 23
1.85 0.25 0.25 0.35
39 25 34 27
268 326 218 215
59 79 25 40
1.40 40 Un 5 1.05 34 Un 10 0.40 26 Un 25 0.30 31
222 138 222 178
28 38 10 37
3795 334 97 59
267 7.1 4.6 1.8
AP BP Mid Wt BP Mid Wt BP Mid Wt BP Mid Wt
BP Mid BP Mid BP Mid BP Mid
N
1 5 10 25
Un 1
GR GR Sym GR Mid CHL
0.20 0.60 1.75 1.80
9l' 31 33 28
'If is the optimal learning rate, N is the number of networks that successfully solved the problem (out of 50, minimum of 25), E ~ J C Sis the mean number of epochs required to reach criterion, and SEM is the standard error of this mean. Algorithms are as described in the text. "Note that this was the best performance for the GR networks.
CHL GeneRec with both symmetry and approximate midpoint method, which is equivalent to CHL (equation 2.7). 7.2 XOR and the 4-2-4 Encoder. The results for the XOR problem are shown in Table 4, and those for the 4 - 2 4 encoder are shown in Table 5. These results are largely consistent with the predictions made above, with the exception of an apparent interaction between the 4 - 2 4 encoder problem and the use of weight symmetrization in GeneRec. Thus, it is apparent that the plain GeneRec algorithm is not very successful or fast, and that weight symmetrization is necessary to improve the success rate (in the XOR task) and the learning speed (in the 4 - 2 4 encoder). As will be shown in more detail below, the symmetrization constraint is essential for computing the correct error derivatives in GeneRec. However, the symmetry constraint also effectively limits the range of weight space that can be searched by the learning algorithm (only symmetric weight configurations can be learned), which might affect its
Generalized Recirculation Algorithm
917
Table 5: Results for the 4-24 Encoder Problemu Algorithm
E
N Epcs SEM
BP AP
2.40 50 2.80 50
60 54
5.1 3.6
BP Mid Wt 1 BP Mid Wt 5 BP Mid Wt 10 BP Mid Wt 25
1.70 1.65 2.35 2.25
50 50 50 50
60 48 45 37
4.3 2.8 3.6 3.0
BP Mid BP Mid BP Mid BP Mid
2.20 2.10 2.10 1.95
50 50 50 50
54 42 40 34
4.2 2.5 2.9 1.8
0.60 1.40 2.40 1.20
45 28 46 28
418 88 60 77
28 2.9 3.4 1.8
Un 1 Un5 Un 10 Un 25
GR GR Sym GR Mid CHL
t ' is the optimal learning rate, N is the number of networks that successfully solved the problem (out of 50), minimum of 25, Epcs is the mean number of epochs required to reach criterion, and SEM is the standard error of this mean. Algorithms are as described in the text.
ability to get out of bad initial weight configurations. This effect may be compounded in an encoder problem, where the input-to-hidden weights also have a tendency to become symmetric with the hidden-to-output weights. Thus, while the symmetry constraint is important for being able to compute the correct error derivatives, it also introduces an additional constraint which can impair learning, sometimes dramatically (as in the case of the 4-24 encoder). Note that on larger and more complicated tasks like the shifter and family trees described below, the advantages of computing the correct derivatives begin to outweigh the disadvantages of the additional symmetry constraint. The other main prediction from the analysis is that the approximate midpoint method will result in faster learning, both in BP and GeneRec. This appears to be the case, where the speedup relative to regular backpropagation was nearly 2-fold for the unit-error based version with a trial step size of 25. The general advantage of the unit-error over the weight based midpoint method in BP is interesting considering that this corresponds to the GeneRec version of the midpoint method. The speedup in GeneRec for both the CHL us. GR Sym and GR Mid us. GR comparisons
918
Randall C. O’Reilly
was substantial in general. Further, it is interesting that the approximate midpoint method alone (without the symmetrization constraint) can enable the GeneRec algorithm to successfully solve problems. Indeed, on both of these tasks, GR Mid performed better than GR Sym. This might be attributable to the ability of the midpoint method to compute better weight derivatives which are less affected by the inaccuracies introduced by the lack of weight symmetry. However, note that while this seems to hold for all of the three-layer networks studied, it breaks down in the family trees task which requires error derivatives to be passed back through multiple hidden layers. Also, only on the 4-24 encoder did GR Mid perform better than CHL, indicating that there is generally an advantage to having the correct error derivatives via weight symmetrization in addition to using the midpoint method. An additional finding is that there appears to be an advantage for the use of a recurrent network over a feedforward one, based on a comparison of AP z s . BP results. This can be explained by the fact that small weight changes in a recurrent network can lead to more dramatic activation state differences than in a feedforward network. In effect, the recurrent network has to do less work to achieve a given set of activation states than does a feedforward network. This advantage for recurrency, which should be present in GeneRec, is probably partially offset by the additional weight symmetry constraint. Further, recurrency appears to become a liability in networks with multiple hidden layers, based on the family trees results presented below. 7.3 The Shifter Task. The shifter problem is a larger task than XOR and the 4- 24 encoder, and thus might provide a more realistic barometer of performance on typical tasks.4 The version of the shifter problem used here had two 4-bit input patterns, one of which was a shifted version of the other. There were three values of shift, -1.0, and 1, corresponding to one bit to the left, the same, and one bit to the right (with wraparound). Of the 16 possible binary patterns on 4 bits, 4 were unsuitable because they result in the same pattern when shifted right or left (1111, 1010, 0101, and 0000). Thus, there were 36 training patterns (the 12 bit patterns shifted in each of 3 directions). The task was to classify the shift direction by activating one of 3 output units. While larger versions of this task (more levels of shift, more bits in the input) were explored, this configuration proved the most difficult (in terms of epochs) for a standard BP network to solve. The results, shown in Table 6, provide clearer support for the predicted relationships than the two previous tasks. In particular, the midpoint-based speedup is comparable between the BP and GeneRec cases, and the role of symmetry in GeneRec is unambiguously important for ‘Note that other common tasks like digit recognition or other classification tasks were found to be so easily solved by a standard BP network (under 10 epochs), that they did not provide a useful dynamic range to make the desired comparisons.
919
Generalized Recirculation Algorithm Table 6: Results for the Shifter Task"
N
Epcs
SEM
BP AP
1.25 10 1.35 10
76.2 56.8
6.4 4.2
BP Mid Wt 0.5 BP Mid Wt 1 BP Mid Wt 2
0.40 0.45 0.35
10 10 10
63.6 42.5 47.0
4.8 2.9 3.5
BP Mid Un 0.5 0.30 10 BP Mid Un 1 0.35 10 BPMid Un 2 0.15 10
48.0 41.2 51.8
1.7 3.3 3.8
1 1650 0.10 0.90 10 105 0.65 10 84.2 0.70 10 42.7
5.0 13.4 2.2
Algorithm
GR GR Sym GR Mid CHL *c
E
-
is the optimal learning rate, N is the number
of networks that successfully solved the problem (out of lo), Epcs is the mean number of epochs
required to reach criterion, and SEM is the standard error of this mean. Algorithms are as described in the text.
solving the task, as is evident from the almost complete failure of the nonsymmetric version to learn the problem. However, it is interesting that even in this more complicated problem the use of the approximate midpoint method without the additional symmetrizing constraint enables the GeneRec networks to learn the problem. Nevertheless, the combination of the approximate midpoint method and the symmetrizing constraint (i.e., the CHL algorithm) performs better than either alone. As in the previous tasks, there appears to be an advantage for the use of a recurrent network over a nonrecurrent one, as evidenced by the faster learning of AP compared to BP.
7.4 The Family Trees Task. As was mentioned in the introduction, the family trees problem of Hinton (1986) is of particular interest because Galland (1993) reported that he was unable to train CHL to solve this problem. While I was unable to get a CHL network to learn the problem with the same number of hidden units as was used in the original backpropagation version of this task (6 "encoding" units per input/output layer, and 12 central hidden units), simply increasing the number of encoding units to 12 was enough to allow CHL to learn the task, although not with 100% reliability. Thus, the learning rate search was performed
Randall C. O'Reilly
920
Table 7: Results for the Family Trees Problem" Algorithm
N
Epcs
SEM
10 10
129 181
3.0 11
BP Mid Wt 0.05 0.37 10 BP Mid Wt 0.1 0.38 10 BP Mid Wt 0.2 0.21 10
131 130 136
6.4 5.1 6.0
BP Mid Un 0.05 0.24 10 BP Mid Un 0.1 0.23 10 BP Mid Un 0.2 0.19 10
127 114 123
6.4 6.7 8.7
GR GR Sym GR Mid CHL
BP AP
f
0.39 0.30
0 10
409
-
0.20 -
0
-
0.10
10
328
23
-
14
I t is the optimal learning rate, N is the number of networks that successfully solved the problem (out of 10, minimum of 5), Epcs is the mean number of epochs required to reach criterion, and SEM is the standard error of this mean. Algorithms are as described in the text.
on networks with 18 encoding and 18 hidden units to ensure that networks were capable of learning. A5 can be seen from the results shown in Table 7, the CHL networks were able to reliably solve this task within a roughly comparable number of epochs as the AP networks. Note that the recurrent networks (GeneRec and AP) appear to be at a disadvantage relative to feedforward BP5 on this task, probably due to the difficulty of shaping the appropriate attractors over multiple hidden layers. Also, symmetry preservation appears to be critical for GeneRec learning in deep networks, since GeneRec networks without this were unable to solve this task (even with the midpoint method). The comparable performance of AP and CHL supports the derivation of CHL via the GeneRec algorithm as essentially a form of backpropagation, and calls into question the analyses of Galland (1993) regarding -'It should be noted that the performance of feedforward BP on this task is much faster than previously reported results. This is most likely due to the use of on-line learning and not using momentum, which enables the network to take advantage of the noise due to the random order of the training patterns to break the symmetry of the error signals generated in the problem and distinguish among the different training patterns.
Generalized Recirculation Algorithm
921
the limitations of CHL as a deterministic approximation to a Boltzmann machine. It is difficult to determine what is responsible for the failure to learn the family trees problem reported in Galland (1993), since there are several differences in the way those networks were run compared to the ones described above, including the use of an annealing schedule, not using the 0.05, 0.95 activation cutoff, using activations with -1 to +1 range, using batch mode instead of on-line weight updating, and using activation-based as opposed to net-input-based settling. Finally, only the unit-error based midpoint method in backpropagation showed a learning speed advantage in this task. This is consistent with the trend of the previous results. The advantage of the unit-error based midpoint method might be due to the reliance on the derivative of the error with respect to the hidden unit itself, which could be a more reliable indication of the curvature of the derivative than the weight derivatives used in the other method. 7.5 The GeneRec Approximation to AP BP. The analysis presented earlier in the paper shows that GeneRec should compute the same error derivatives as the Almeida-Pineda version of error backpropagation in a recurrent network if the following conditions hold:
The difference of the plus and minus phase activation terms in GeneRec, which are updated in separate iterative activation settling phases, can be used to compute a unit’s error term instead of the iterative update of the difference itself, which is what AlmeidaPineda uses. This enables the activa4 The reciprocal weights are symmetric. tion signals from the output to the hidden units (via the recurrent weights) to reflect the contribution that the hidden units made to the output error (via the forward-going weights). 0 The difference of activations in the plus and minus phases is a reasonable approximation to the difference of net inputs times the derivative of the sigmoidal activation function. Note that this affects only the overall magnitude of the weight derivatives, not their direction. To evaluate the extent to which these conditions are violated and the effect that this has on learning in GeneRec, two identical networks were run side-by-side on the same sequence of training patterns, with one network using AP (with the cross-entropy error function) to compute the error derivatives, and the other using GeneRec. The standard 4-24 encoder problem was used. The extent to which GeneRec error derivatives are the same as those computed by AP was measured by the normalized dot product between the weight derivative vectors computed by the two algorithms. This comparison was made for the input-to-hidden weights (I j H ) since they reflect the error derivatives computed by the hidden units. Since the hidden-to-output weights are driven by the error signal 0
922
Randall C. O'Reilly
on the output units, which is given by the environment, these derivatives were always identical between the two networks. To control for weight differences that might accumulate over time in the two networks, the weights were copied from the GeneRec network to the AP network after each weight update. Networks were also run without this "yoking" of the weights to determine how different the overall learning trajectory was between the two algorithms given the same initial weight values. The weights were always initialized to be symmetric. As was noted above, the basic GeneRec algorithm does not preserve the symmetry of the weights, which will undoubtedly affect the computation of error gradients. The extent of symmetry was measured by the normalized dot product between the reciprocal hidden and output weights. It is predicted that this symmetry measure will determine in large part the extent to which GeneRec computes the same error derivatiIres as AP. To test this hypothesis, two methods for preserving the symmetry of the weights during learning were also used. One method was to use the symmetry-preserving learning rule shown in equation 5.3, and the other was a "brute-force" method where reciprocal weights were set to the average of the two values after they were updated. The advantage of this latter method is that, unlike the equation 5.3 rule, it does not change the computed weight changes. The parameters used in the networks were activation step size (dt) of 0.2, initial activations set to 0, settling cutoff at 0.01 maximum change in activation, learning rate of 0.6, and initial weights uniformly random between *0.5. The main result of this analysis is that the GeneRec algorithm typically computes essentially the same error derivatives as AP except when the iueiglik arc not symnretric. This can be seen in Figure 3, which shows two different networks running with weights yoked and using the brute-force symmetrizing method. The weight derivatives computed by GeneRec have a normalized dot product with those computed by AP that is nearly always I, except when the weights got very large in the network that was stuck in a local minimum. This result shows that GeneRec usually computes the appropriate backpropagation error gradient based on the difference of equilibrium activation states in the plus and minus phases, supporting the approximation given in equation 4.5. In contrast, when no weight symmetrizing is being enforced, the correspondence between the GeneRec and AP weight derivatives appears to be correlated with the extent to which the weights are symmetric, as can be seen in Figure 4. Indeed, based on results of many runs (not shown), the ability of the GeneRec network to solve the task appeared to be correlated with the extent to which the weights remain symmetric. Note that even without explicit weight symmetrization or a symmetry preserving learning rule, the weights can become symmetric due to a fortuitous correspondence between weight changes on the reciprocal sets of weights.
Generalized Recirculation Algorithm
a)
GeneRec vs Almeida-Pineda
923
b)
GeneRec vs. Almeida-Pineda
Figure 3: (a,b) Correspondence (average normalized dot product) of weight derivatives computed by GeneRec and Almeida-Pineda algorithms for two random initial weight configurations. The weights were yoked to those computed by GeneRec, and symmetry was preserved by the brute force method. The correspondence is nearly perfect until late in the training of the stuck network, at which point the network has developed very large weights, which appear to affect the accuracy of the computed weight derivatives. Using the symmetry preserving rule (equation 5.3) resulted in weight changes that were typically different from those computed by AP, even though the above results show that the error derivatives at the hidden unit were correct. This is simply due to the fact that symmetric GeneRec has an additional symmetry preserving term that is not present in AP. Nevertheless, the symmetric GeneRec algorithm resulted in nonyoked learning trajectories that mirrored those of the AP algorithm remarkably closely. A representative example is shown in Figure 5. It is difficult to be certain about the source of this correspondence, which did not occur in nonyoked networks using the brute-force symmetry preservation method. Finally, there is some question as to whether GeneRec will compute the correct error derivatives in a network with multiple hidden layers, where the differences in the way GeneRec and AP compute the error terms might become more apparent due to the greater influence of recurrent setting in both the minus and plus phases. Also, based on the kinds of approximations made in deriving CHL as a deterministic Boltzmann machine, and the results of simulations, Galland (1993) concluded that the limitations of CHL become more apparent as the number of hidden layers are increased (i.e., in "deep" networks).
Randall C. O'Reilly
924
b)
GeneRec vs. Almeida-Pineda
GeneRer vs. Almeida-Pineda M d w m bet. Yoked l o C r n a R ~ sN o S ~ m m e e o l l l m
Fu Ntt, Y a r d 10 CIlURcs No SpmmrViuGa,
i
0
28
u
u
so
Epochs of Training
In
1 110
0.0
t
i
A
0
50
iw
150
200
250
Epochs of Training
Figure 1: (a,b) Correspondence (average normalized dot product) of weight derivatives computed by GeneRec and Almeida-Pineda algorithms for two random initial weight configurations. The weights were yoked to those computed by GeneRec. No symmetry was imposed on the weights. The correspondence appears t o be roughly correlated with the extent to which the weights are symmetric.
To address the performance of GeneRec in a deep network, the same analysis as described above was performed on the family trees network (Hinton 1986) with the brute-force symmetrization and weight yoking. This network has three layers of hidden units. Normalized dot-product measurements of the error derivatives computed on the weights from the "agent input" to the "agent encoding" hidden layer 1 + A, "agent encoding" to the central hidden layer A + H , and the hidden layer to the "patient encoding" layer H + P. These weights are 3, 2, and 1 hidden layers (respectively) removed from the output layer. Figure 6a shows that GeneRec still computes largely the same error derivatives as AP backpropagation even in this case. The normalized dot product measures were usually greater than 0.9, and never went below 0.7. The discrepancy between GeneRec and AP tended to increase as training proceeded for the deeper weights (I +. A ) . This shows that as the weights got larger, the differences between GeneRec and AP due to the way that the error is computed over recurrent settling became magnified. One of the primary problems with CHL that was emphasized by Galland (1993) is the jumpy character of the error function over learning, which was argued to not provide much useful guidance in learning. However, Figure 6b shows that the AP algorithm also suffers from a bumpy error surface. The frequency of the AP bumps seems to be a bit
Generalized Recirculation Algorithm
925
d
0
H
)
1
w
1
5
0
m
n
2
5
0
Y
Epochs of Training
Figure 5: Learning trajectories and error derivative correspondence for nonyoked AP and GeneRec networks with the same initial weights. (a) Standard GeneRec without any weight symmetrization. (b) GeneRec with the symmetry preserving learning rule. Even though this rule does not result in the networks computing the same weight updates, they follow a remarkably similar learning trajectory. This is not the case for regular GeneRec. lower, but the amplitude can be higher. This indicates that the bumpiness is due at least in part to the recurrent nature of the network, where small weight changes can lead to very different activation states, and not to a deficiency in the learning algorithm per se. 8 Possible Biological Implementation of GeneRec Learning
The preceding analysis and simulations show that the GeneRec family of phase-based, error-driven learning rules can approximate error backpropagation using locally available activation variables. The fact that these variables are available locally makes it more plausible that such a learning rule could be employed by real neurons. Also, the use of activation-based signals (as opposed to error or other variables) increases plausibility because it is relatively straightforward to map unit activation onto neural variables such as time-averaged membrane potential or spiking rate. However, there are three main features of the GeneRec algorithm that could potentially be problematic from a biological perspective: (1) weight symmetry, (2) the origin of plus and minus phase activation states, and (3) the ability of these activation states to influence synaptic modification according to the learning rule. These issues are addressed below in the context of the CHL version of GeneRec (equation 2.7), since
Randall C. O'Reilly
926
b)
Afmeida-Pin& Learning Trajectory Inmil, Trees Nehmrt
1.0
a9 $0.8
E
a7
L
-..
4
t *
Y
Figure 6 : (a) Correspondence (average normalized dot product) of weight derivatives computed by GeneRec and Almeida-Pineda algorithms for a family trees network. The weights were yoked to those computed by GeneRec, and symmetry was preserved by the brute-force method. I + A is the "agent input" to the "agent encoding" hidden layer weights, A =+ H is the "agent encoding" to the central hidden layer weights, and H 3 P is the hidden layer to the "patient encoding" layer weights. The correspondence is not as good as that for the three-layer network, but remains largely above 0.9. (b) The learning trajectory for just the AP algorithm, which is not smooth like feedforward BP. it is the overall best performer, and has a simpler form than the other GeneRec versions. Since the neocortex is the single most important brain area for the majority of cognitive phenomena, it is the focus of this discussion. 8.1 Weight Symmetry in the Cortex. There are two ways in which the biological plausibility of the weight symmetry requirement in GeneRec (which was shown above to be important for computing the correct error gradient) can be addressed. One is to show that exact symmetry is not critical to the proper functioning of the algorithm, so that only a rough form of symmetry would be required of the biology. The other is to show that at least this rough form of symmetry is actually present in the cortex. Data consistent with these arguments are summarized briefly here. As a first-order point, Hinton (1989b) noted that a symmetry preserving learning algorithm like CHL, when combined with weight decay, will automatically lead to symmetric weights even if they did not start out that way However, this assumes that all of the units are connected to each
Generalized Recirculation Algorithm
927
other in the first place. This more difficult case of connection asymmetry was investigated in Galland and Hinton (1991) for the CHL learning algorithm. It was found that the algorithm was still effective even when all of the connectivity was asymmetric (i.e., for each pair of noninput units, only one of the two possible connections between them existed). This robustness can be attributed to a redundancy in the ways in which the error signal information can be obtained ( i e , a given hidden unit could obtain the error signal directly from the output units, or indirectly through connections to other hidden units). Also, note that the absence of any connection at all is very different from the presence of connection with a nonsymmetric weight value, which is the form of asymmetry that was found to be problematic in the above analysis. In the former case, only a subset of the error gradient information is available, while the latter case can result in specifically wrong gradient information due to the influence of the nonsymmetric weight. Due to the automatic symmetrization property of CHL, the latter case is unlikely to be a problem. In terms of biological evidence for symmetric connectivity, there is some indication that the cortex is at least roughly symmetrically connected. At the level of identifiable anatomical subregions of cortex, the vast majority of areas (at least within the visual cortex) are symmetrically connected to each other. That is, if area A projects to area B, area A also receives a projection from area B (Felleman and Van Essen 1991). At the level of cortical columns or ”stripes” within the prefrontal cortex of the monkey, Levitt et al. (1993) showed that connectivity was symmetric between interconnected stripes. Thus, if a neuron received projections from neurons in a given stripe, it also projected to neurons in that stripe. The more detailed level of individual neuron symmetric connectivity is difficult to assess empirically, but there is at least no evidence that it does not exist. Further, given that there is evidence for at least rough symmetry, detailed symmetry may not be critical since, as demonstrated by Galland and Hinton (1991), CHL can use asymmetrical connectivity as long as there is some way of obtaining reciprocal information through a subset of symmetric connections or indirectly via other neurons in the same area. 8.2 Phase-Based Activations in the Cortex. The origin of the phasebased activations that are central to the GeneRec algorithm touches at the heart of perhaps the most controversial aspect of error-driven learning in the cortex: “where does the teaching signal come from?” In GeneRec, the teaching signal is just the plus-phase activation state. Thus, unlike standard backpropagation, GeneRec suggests that the teaching signal is just another state of ”experience” in the network. One can interpret this state as that of experiencing the actual outcome of some previous conditions. Thus, the minus phase can be thought of as the expectation of the outcome given these conditions. For example, after hearing the first three words of a sentence, an expectation will develop of which word
Randall C. O'Reilly
928
is likely to come next. The state of the neurons upon generating this expectation is the minus phase. The experience of hearing or reading the actual word that comes next establishes a subsequent locomotive6 state of activation, which serves as the plus phase. This idea that the brain is constantly generating expectations about subsequent events, and that the discrepancies between these expectations and subsequent outcomes can be used for error-driven learning, has been suggested by McClelland (1994) as a psychological interpretation of the backpropagation learning procedure. It is particularly attractive for the GeneRec version of backpropagation, which uses only activation states, because it requires no additional mechanisms for providing specific teaching signals other than the effects of experience on neural activation states in a manner that is widely believed to be taking place in the cortex anyway. Further, there is evidence from ERP recordings of electrical activity over the scalp during behavioral tasks that cortical activation states reflect expectations and are sensitive to differential outcomes. For example, the widely studied P300 wave, which is a positive-going wave that occurs around 300 msec after stimulus onset, is considered to measure a violation of subjective expectancy which is determined by preceding experience over both the short and long term (Hillyard and Picton 1987). In more formal terms, Sutton ef 01. (1965) showed that the P300 amplitude is determined by the amount of prior uncertainty that is resolved by the processing of a given event. Thus, the nature of the P300 is consistent with the idea that it represents a plus phase wave of activation following in a relatively short time-frame the development of minus phase expectations. While the specific properties of the P300 itself might be due to specialized neural mechanisms for monitoring discrepancies between expectations and outcomes, its presence suggests the possibility that neurons in the mammalian neocortex experience two states of activation in relatively rapid succession, one corresponding to expectation and the other corresponding to outcome. Finally, note that for most of the GeneRec variants, it seems that the neuron needs to have both the plus and minus phase activation signals in reasonably close temporal proximity to adjust its synapses based on both lralues. This is consistent with the relatively rapid expectation-outcome interpretation given above. However, CHL is a special case, since it is simply the difference between the coproduct of same-phase activations, which could potentially be computed by performing simple Hebbian associative learning for the plus phase at any point, and at any other point, performing anti-Hebbian learning on the minus phase activations (Hinton and Sejnowski 1986). This leaves open the problems of how the brain would know when to change the sign of the weight change, and how this kind o f global switch could be implemented. Also, people are capable of learning things relatively quickly (within seconds or at least minutes), so ~~~
.~
"'This is just to demonstrate that such expectations are being generated and it is d i c n t when thev arc violated.
929
Generalized Recirculation Algorithm Table 8: Directions of Weight Change’
Plus phase variables aTaj+ E O a+aj+ E 1
“al-
Minus phase variables sz 1 ([Ca’’]i elevated)
sz 0 ([Ca’+]i near 0)
Aw,~ =0 Awij = + (LTP)
- (LTD) A w , ~= O*b
A ~ i= j
aDirections of weight change according to the CHL rule for four qualitative conditions, consisting of the combinations of two qualitative levels of minus and plus phase activation coproduct values. The minus phase activation coproduct is thought to correspond to [Ca”],. Increases in synaptic efficacy correspond to long-term potentiation (LTP) and decreases are long-term depression (LTD). b7his cell is not consistent with the biological mechanism because both [Ca2+],and synaptic activity lead to LTP, not the absence of LW. See text for a discussion of this point.
this phase switching is unlikely to be a function of the difference between REM sleep and waking behavior, as has been suggested for phase-based learning algorithms (Hinton and Sejnowski 1986; Linsker 1992; Crick and Mitchison 1983). While it might be possible to come up with answers to these problems, a temporally local mechanism like that suggested above seems more plausible. 8.3 Synaptic Modification Mechanisms. Having suggested that the minus and plus phase activations follow each other in rapid succession, it remains to be shown how these two activation states could influence synaptic modification in a manner largely consistent with the CHL version of GeneRec (equation 2.7). It turns out that the biological mechanism proposed below accounts for only three out of four different qualitative ranges of the sign of the weight change required by CHL (see Table 8). Specifically, the proposed mechanism predicts weight increase to occur when both the pre- and postsynaptic neurons are active in both the plus and minus phases, whereas CHL predicts that the weight change in this condition should be zero. Thus, the proposed mechanism corresponds to a combination of CHL and a Hebbian-style learning rule, the computational implications of which are the subject of O’Reilly (1996), which shows that the combination of error-driven and associative learning can be generally beneficial for solving many different kinds of tasks. To briefly summarize the findings of O’Reilly (1996),the Hebbian component can be thought of as imposing additional constraints on learning much in the way that weight decay is used in conjunction with standard error backpropagation. However, the constraints imposed by Hebbian learning are actually capable of producing useful representations on their own (unlike weight decay, which would simply reduce all weights to zero if left to its own devices). Thus, the combination of CHL and Hebbian learning results in networks that, unlike those with weight decay,
Randall C. O’Reilly
930
learn faster (especially in deep networks like the family trees task), and generalize better (due to the effects of the additional constraints) than plain CHL networks. However, for the purposes of the present paper, the crucial aspect of the following mechanism is that it provides the error correction term, which occurs when the synaptic coproduct a,a, was larger in the minus phase than in the plus phase. This is the defining aspect of the error-driven learning performed by CHL, since the other qualitative ranges of the CHL learning rule are similar to standard Hebbian learning, as is evident from Table 8. For GeneRec-style learning to occur at a cellular and synaptic level, the neuron needs to be able to retain some trace of the minus phase activation state through the time when the neuron experiences its plus phase activation state. Reasoning from the ERP data described above, this time period might be around 300 msec or so. A likely candidate for the minus phase trace is intracellular Ca” ([Ca”],), which enters the postsynaptic area via NMDA channels if both pre- and postsynaptic neurons are active. To implement a GeneRec-style learning rule, this minus phase !Ca’-;, trace needs to interact with the subsequent plus phase activity to determine if the synapse is potentiated (LTP) or depressed (LTD). In what follows, the term synaptic actizlity will be used to denote the activation coproduct term a,n,, which is effectively what determines the amount of iCa’- that enters through the NMDA channel (Collingridge and Bliss 1987). There are two basic categories of mechanism which can provide the crucial error-correcting modulation of the sign of synaptic modification required by CHL. One such mechanism involves an interaction between membrane potential or synaptic activity and [Ca’+],, while another depends only on the level of [Ca’-],. Further, there are many ways in which these signals and their timing can affect various second-messenger systems in the cell to provide the necessary modulation. In favor of something like the first mechanism, there is evidence that the mere presence of postsynaptic [Ca’”], is insufficient to cause LTP (Kullmann et al. 1992; Bashir ct 01. 1993), but it is unclear exactly what additional factor is necessary (Bear and Malenka 1994). One hypothesis is that LTP depends on the activation of metabotropic glutamate receptors, which are activated by presynaptic activity and can trigger various mechanisms in the postsynaptic synaptic compartment (Bashir et al. 1993). On the other hand, a proposed mechanism that depends only on the level of postsynaptic [Ca”], (Lisman 1989),has received some recent empirical support (reviewed in Lisman 1994; Bear and Malenka 1994). This proposal stipulates that increased but moderate concentrations of postsynaptic [Ca2+], iead to LTD, while higher concentrations lead to LTP. Artola and Singer (1993) argue that this mechanism is consistent with the A B S learning rule (Hancock et al. 1991; Artola e t a / . 1989; Bienenstock cf 02. 1982), which stipulates that there are two thresholds for synaptic modification, 0 and (-1 . A level of [Ca”], that is higher than the high threshold (-)+ leads to
I,
+
Generalized Recirculation Algorithm
931
LTP, while a level that is lower than this high threshold but above the lower 0- threshold leads to LTD. Either of the above mechanisms would be capable of producing the pattern of synaptic modification shown in Table 8 in the context of a proposed mechanism defined by the following properties:
I. Some minimal level of [ca2+],is necessary for any form of synaptic modification (LTP or LTD). 2. [Ca2+],changes relatively slowly, and persists for at least 300+ msec. This allows [Ca2'], to represent prior minus phase activity, even if the synapse is not subsequently active in the plus phase. 3. Synaptic modification occurs based on the postsynaptic state after plus phase activity. This can happen locally if synaptic modification occurs after around 300+ msec since the entry of Ca2+into the postsynaptic area (and the plus phase activity states last for at least this long). Alternatively, there could be a relatively global signal corresponding to the pius phase that triggers synaptic modification (e.g., as provided by dopaminergic or cholinergic modulation triggered by systems sensitive to the experience of outcomes following expectations). 4. If [Ca2+],was present initially (in the minus phase) due to synaptic activity, but then the synaptic activity diminished or ceased (in the plus phase), LTD should occur. This would be expected from the mechanisms described above, either because of an explicit interaction between synaptic activity at the time of modification in the plus phase and the trace of [Ca2+],from the minus phase, or because the minus phase [Ca2+],will have decayed into the LTD range by the time modification occurs in the plus phase. 5. If synaptic activity is taking place in the plus phase state, sufficient [Ca2+],is present and LTP occurs. Note that this means that any time the plus-phase activation coproduct (a:aT) is reasonably large, regardless of whether there was any prior minus phase activity, the weights will be increased. This leads to a combined CHL and Hebbian learning rule as discussed above. There is direct evidence in support of several aspects of the proposed mechanism, some of which was discussed above, and indirect evidence in support of most of the remainder. Since the empirical literature on LTP and LTD is vast, only a brief summary will be given here (see Artola and Singer 1993; Bear and Malenka 1994; Linden 1994; Malenka and Nicoll 1993, for recent reviews). It should be noted that most of these findings have been described both in the hippocampus and neocortex, and appear to be quite general (Artola and Singer 1993; Linden 1994). Also, note that the NMDA receptor itself is not subject to potentiation, so that the current value of the synaptic weight does not have to be included in learning rules, which is in accordance with GeneRec.
932
Randall C. O’Reilly
With respect to point 1, the importance of [Ca’-], for LTP has been known for a while (e.g., Collingridge and Bliss 1987), and it is now clear that it is critical for LTD as well (Brocher c’t a/. 1992; Mulkey and Malenka 1992; Hirsh and Crepe1 1992). In support of point 2, the time course of jCa2’ 1, concentration has been measured in several studies (e.g., Jaffe et nl. 1992; Perkel rf 01. 1993), and it appears to be relatively long-lasting (on the order of 1 or more seconds), though it is not clear that these results reflect what would happen under less invasive conditions. As for point 3, Malenka d a t . (1992) found that a significant time period (up to 1-2 sec) of enhanced postsynaptic [Ca”], was necessary for LTP induction. Also, typical LTP and LTD induction regimes involve constant stimulation at a given frequency for time periods longer than a second. However, the precise time course of synaptic potentiation needs to be studied in greater detail to evaluate this issue fully. With respect to the existence of a global learning signal, the E R P data described earlier and the role of neuromodulatory systems like dopamine suggest that such monitoring systems might exist in the brain. For example, Schultz P t nl. (1993) describe the important role that dopamine plays in learning and responding to salient environmental stimuli. However, these modulatory effects are probably not of an all-or-nothing nature, and, given that LTP and LTD can be induced by the direct electrical stimulation of individual neurons, it is not likely that learning is completely dependent on a global signal. To summarize, the proposed synaptic modification mechanism is consistent with several findings, but also requires further mechanisms. As such, the proposal outlined above constitutes a set of predictions regarding additional factors that should determine the sign and magnitude of synaptic modification. 9 Conclusions
The analysis and simulation results presented in this paper support the idea that the GeneRec family of learning algorithms is performing variations of error backpropagation in a recurrent network using locally available activation variables. However, there is no single GeneRec algorithm that is exactly equivalent to the Almeida-Pineda algorithm for backpropagation in recurrent networks since GeneRec requires symmetric weights yet it is not itself symmetry preserving. The idea that the CHL algorithm is equivalent to a symmetry-preserving version of GeneRec using the midpoint integration method is supported by the pattern of learning speed results for the different versions of GeneRec, and by the learning speed increases obtained when using the approximate midpoint integration method in backpropagation networks. Further, it was shown that CHL (and symmetric GeneRec without the midpoint method) can reliably learn the family trees problem, calling
Generalized Recirculation Algorithm
933
into question the idea that CHL is a fundamentally flawed learning algorithm for deterministic networks, as was argued by Galland (1993). Thus, the weight of the evidence suggests that CHL should be viewed as a variation of recurrent backpropagation, not as a poor approximation to the Boltzmann machine learning algorithm. However, as a consequence of the differences between GeneRec and AP backpropagation (mainly the symmetry constraint), one can expect GeneRec to have somewhat different characteristics compared to standard backpropagation algorithms, and these differences have implications for psychological or computational models (O’Reilly, 1996). Thus, the present analysis does not imply that just because there exists a biologically plausible form of backpropagation, all forms of backpropagation are now biologically plausible. Finally, while CHL gave the best performance of the GeneRec networks in three out of the four tasks studied in this paper, the symmetry preservation constraint ended up being a liability in the 4-24 encoder task. Thus, the GeneRec-based derivation of CHL can have practical consequences in the selection of an appropriate algorithm for a given task. Also, this derivation allows one to derive CHL-like algorithms for different activation functions, and other network parameters. Perhaps the most important contribution of this work is that it provides a unified computational approach to understanding how errordriven learning might occur in the brain. Given that the GeneRec learning rules are quite possibly the most simple and local way of performing a very general and powerful form of learning, it seems plausible that the brain would be using something like them. The specific biological mechanism proposed in this paper, which is consistent with several empirical findings, provides a starting point for exploring this hypothesis. Appendix A: Plus Phase Approximation to Trial Step Activations
~
The relationship between the plus-phase activation of a GeneRec hidden unit j (h’) and that which would result if an Euler weight update step were taken to reduce the error (denoted h;) can be more formally established. This is done by simply recomputing the activation of the hidden unit based on the weights after they have been updated from the current error derivatives. Using the basic GeneRec algorithm with the difference of net-input terms instead of activation terms (this makes the computation easier), the trial step (starred) weights would be as follows:
Note that the original value of OF is used here, whereas in the exact computation of the midpoint method in a recurrent network, the output activation value would change when the weights are changed. However,
Randall C. O’Reilly
934
it is impossible to express in closed form what this value would be since it would result from a settling process in the recurrent network, so the original value is used as an approximation to the actual value. This is the only sense in which the following analysis is approximate. The trial step weights above can then be used to compute the netinput that the unit would receive after such weight changes (denoted )/;) as follows (using the fact that tiI = 1, SJU,~ + C k o;zukI):
To simplify, let
which gives
Thus, the plus-phase net-input (and therefore activation) is equivalent to a forward Euler step if the learning rate F is set so as to meet the conditions in equation A.5:
Using a fixed learning rate that is smaller than that given by equation A.6 would result in a starred (trial step) activation value that is in the same direction as the plus-phase activation value (since the u term is bounded between zero and one) but not quite as different from the minus phase value. Acknowledgments ~
_
_
-
I would like to thank the following people for their useful comments on earlier drafts of this manuscript: Peter Dayan, Jay McClelland, Javier Movellan, Yuko Munakata, and Rich Zemel.
References
Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cogti. Sci. 9, 147-169.
Generalized Recirculation Algorithm
935
Almeida, L. B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Proceedings of the I E E E First lnternational Conference on Neural Networks San Diego, CA, M. Caudil and C. Butler, eds., pp. 609-618. Artola, A,, and Singer, W. 1993. Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. Trends Neurosci. 16, 480. Artola, A., Brocher, S., and Singer, W. 1990. Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature (London) 347, 69-72. Bashir, Z., Bortolotto, Z. A., and Davies, C. H. 1993. Induction of LTP in the hippocampus needs synaptic activation of glutamate metabotropic receptors. Nature (London) 363, 347-350. Battiti, T. 1992. First- and second-order methods for learning: Between steepest descent and Newton’s method. Neural Comp. 4(2), 141-166. Bear, M. F., and Malenka, R. C. 1994. Synaptic plasticity: LTP and LTD. Curr. Opin. Neurobiol. 4, 389-399. Bienenstock, E. L., Cooper, L. N., and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2(2), 3248. Brocher, S., Artola, A., and Singer, W. 1992. Intracellular injection of Ca2+ chelators blocks induction of long-term depression in rat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 89, 123. Collingridge, G. L., and Bliss, T. V. P. 1987. NMDA receptors-their role in long-term potentiation. Trends Neurosci. 10, 288-293. Crick, F. H. C. 1989. The recent excitement about neural networks. Nature (London) 337, 129-132. Crick, F. H. C., and Mitchison, G. 1983. The function of dream sleep. Nature (London) 304, 111-114. Felleman, D. J., and Van Essen, D. C. 1991. Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex 1, 1 4 7 . Galland, C. C. 1993. The limitations of deterministic Boltzmann machine learning. Network 4, 355-379. Galland, C. C., and Hinton, G. E. 1990. Discovering high order features with mean field modules. In Advances in Neural Information Processing Systems, 2, D. S. Touretzky, ed., Morgan Kaufmann, San Mateo, CA. Galland, C. C., and Hinton, G. E. 1991. Deterministic Boltzmann learning in networks with asymmetric connectivity. In Connectionist Models: Proceedings of the 1990 Summer School, D. S . Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, eds., pp. 3-9. Morgan Kaufmann, San Mateo, CA. Hancock, P. J. B., Smith, L. S., and Phillips, W. A. 1991. A biologically supported error-correcting learning rule. Neural Comp. 3, 201-212. Hillyard, S. A,, and Picton, T. W. 1987. Electrophysiology of cognition. In Handbook of Physiology, Section 1: Neiirophysiology Volume V : Higher Functions of the Brain, F. Plum, ed., pp. 519-584. American Physiological Society. Hinton, G. E. 1986. Learning distributed representations of concepts. Proceed-
936
Randall C. OReilly
ings of the 8th Coilference of the Cognitive Scierice Societ!y, pp. 1-12. Lawrence Erlbaum, Hilisdale, NJ. Hinton, G. E. 1989a. Connectionist learning procedures. Artificinl IFrtelligence 40, 185-234. Hinton, G. E. 1989b. Deterministic Bokzmann learning performs steepest descent in weight-space. Neiiral Comp. 1, 143-150. Hinton, G. E., and McClelland, J. L. 1988. Learning representations by recirculation. In Neiirnl Ii!forrnation Processing S!ystenzs, 1987, D. Z . Anderson, ed., pp. 358-366. American Institute of Physics, New York. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines, Chap. 7. In Parallel Distributed Processing. Voliirne 1: Foundations, D. E. Rumelhart, J. L. McClelland, and PDP Research Group, eds., pp. 282317. MIT Press, Cambridge, MA. Hirsh, J. C., and Crepel, F. (1992). Postsynaptic Ca" is necessary for the induction of LTP and LTD of monosynaptic EPSPs in prefrontal neurons: An in vitro study in the rat. Syiinpse 10, 173-175. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nntl. Acnd. Sci. U.S.A. 81, 3088-3092. Jaffe, D. B., Johnston, D., Lasser-Ross, N., Lisman, J. E., Miyakawa, H., and Ross, W. N . 1992. The spread of Nat spikes determines the pattern of dendritic Ca2- entry into hippocampal neurons. Nnfiirc (Londotl) 357, 244-246. Kullmann, D. M., Perkel, D. J., Manabe, T., and Nicoll, R. A. 1992. Ca2+ entry via postsynaptic voltage-sensitive Ca" channels can transiently potentiate excitatory synaptic transmission in the hippocampus. Neuron 9, 117551183, LeCun, Y., and Denker, J. S. 1991. A new learning rule for recurrent networks. Proceedings of the Coilference on Neiirnl Nrt7c~orks,fbrConipiftirig, Snozubivd, UT. Levitt, J. B., Lewis, D. A,, Yoshioka, T.,and Lund, J. S. 1993. Topography of pyramidal neuron intrinsic connections in macaque monkey prefrontal cortex (areas 9 & 46). 1. c0771p.New"/. 338, 360-376. Linden, D. J . 1994. Long-term synaptic depression in the mammalian brain. Neiiron 12, 457472. Linsker, R. 1992. Local synaptic learning rules suffice to maximize mutual information in a linear network. Neiirol C o n y . 4, 691-702. Lisman, J. 1994. The CaM Kinase I1 hypothesis for the storage of synaptic memory. Tretzds Neiivosci. 17, 406. Lisman, J. E. 1989. A mechanism for the Hebb and the anti-Hebb processes underlying learning and memory. Pvoc. Noti. A d . Sci. U . S . A . 86, 9574-9578. Malenka, R. C., and Nicoll, R. A. 1993. NMDA receptor-dependent synaptic plasticity: Multiple forms and mechanisms. Trrnils Nmrosci. 16, 521-527. Malenka, R. C., Lancaster, B., and Zucker, R. S. 1992. Temporal limits on the rise in postsynaptic calcium required for the induction of long-term potentiation. Ncliirotl 9, 121-128. Mazzoni, P., Andersen, R. A., and Jordan, M. I. 1991. A more biologically plausible learning rule for neural networks. Proc. Noti. Acad. Sci. U.S.A. 88, 44334437. McClelland, J. L. 1991. The interaction of nature and nurture in development:
Generalized Recirculation Algorithm
937
A parallel distributed processing perspective. In Current Advances in Psyckological Science: Ongoing Research, P. Bertelson, P. Eelen, and G. DYdewalle, eds., pp. 57-88. Lawrence Erlbaum, Hillsdale, NJ. Movellan, J. R. 1990. Contrastive Hebbian learning in the continuous Hopfield model. In Proceedings of the 2989 Connectionist Models Summer School, D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski, eds., pp. 10-17. Morgan Kaufmann, San Mateo, CA. Mulkey, R. M., and Malenka, R. C. 1992. Mechanisms underlying induction of homosynaptic long-term depression in area CA1 of the hippocampus. Science 9, 967-975. OReilly, R. C. 1996. The Leabra model of neural interactions and learning in the neocortex. Ph.D. thesis. Carnegie Mellon University, Pittsburgh, PA. Perkel, D. J., Petrozzino, J. J., Nicoll, R. A., and Connor, J. A. 1993. The role of Ca2+ entry via synaptically activated NMDA receptors in the induction of long-term potentiation. Neuron 11, 817-823. Peterson, C. 1991. Mean field theory neural networks for feature recognition, content addressable memory, and optimization. Connection Sci. 3, 3-33. Peterson, C., and Anderson, J. R. 1987. A mean field theory learning algorithm for neural networks. Complex Syst. 1, 995-1019. Peterson, C., and Hartman, E. 1989. Explorations of the mean field theory learning algorithm. Neural Networks 2, 475494. Pineda, F. J. 1987a. Generalization of backpropagation to recurrent and higher order neural networks. In Proceedings of IEEE Conference on Neural Information Processing Systems, Denver, CO, D. Z. Anderson, ed., pp. 602-611. IEEE, New York. Pineda, F. J. 1987b. Generalization of backpropagation to recurrent neural networks. Pkys. Rev. Lett. 18, 2229-2232. Pineda, F. J. 1988. Dynamics and architecture for neural computation. 1. Complexity 4,216-245. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1988. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986a. Learning internal representations by error propagation, Chap. 8. In Parallel Distributed Processing. Volume I: Foundations, D. E. Rumelhart, J. L. McClelland, and PDP Research Group, eds., pp. 318-362. MIT Press, Cambridge, MA. Rumelhart, D. E., McClelland, J. L., and PDP Research Group, eds. 1986b. Parallel Distributed Processing. Volume 1: Foundations. MIT Press, Cambridge, MA. Schultz, W., Apicella, P., and Ljungberg, T. 1993. Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. J. Neurosci. 13, 900-913. Sutton, S., Braren, M., Zubin, J., and John, E. R. 1965. Evoked-potential correlates of stimulus uncertainty. Science 150, 1187-1188. Tesauro, G. 1990. Neural models of classical conditioning: A theoretical viewpoint. In Connectionist Modellingand Brain Function, S . J. Hanson and C. R. 01son, eds., MIT Press, Cambridge, MA.
938
Randall C. O’Reilly
Zipser, D., and Andersen, R. A. 1988. A back propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Notiirc7(Lotrdoii) 331, 679484. Zipser, D., and Rumelhart, D. E. 1990. Neurobioiogcial significance of new learning models. In Conrpiitntioiinl Neriroscirrice, E. Schwartz, ed., pp. 192200. MIT Press, Cambridge, MA.
Receitwl August 7, 1995; accepted January 8, 1996
This article has been cited by: 1. M. H. Davis, M. G. Gaskell. 2009. A complementary systems account of word learning: neural and behavioural evidence. Philosophical Transactions of the Royal Society B: Biological Sciences 364:1536, 3773-3800. [CrossRef] 2. Robert Leech, Denis Mareschal, Richard P. Cooper. 2008. Analogy as relational priming: A developmental and computational perspective on the origins of a complex cognitive skill. Behavioral and Brain Sciences 31:04. . [CrossRef] 3. Robert Leech, Denis Mareschal, Richard P. Cooper. 2008. Growing cognition from recycled parts. Behavioral and Brain Sciences 31:04. . [CrossRef] 4. Steve Donaldson. 2008. A Neural Network for Creative Serial Order Cognitive Behavior. Minds and Machines 18:1, 53-91. [CrossRef] 5. T. E. Hazy, M. J. Frank, R. C. O'Reilly. 2007. Towards an executive without a homunculus: computational models of the prefrontal cortex/basal ganglia system. Philosophical Transactions of the Royal Society B: Biological Sciences 362:1485, 1601-1613. [CrossRef] 6. James L. McClelland, Richard M. Thompson. 2007. Using domain-general principles to explain children's causal reasoning abilities. Developmental Science 10:3, 333-356. [CrossRef] 7. Sverker Sikström. 2006. The Isolation, Primacy, and Recency Effects Predicted by an Adaptive LTD/LTP Threshold in Postsynaptic Cells. Cognitive Science 30:2, 243-275. [CrossRef] 8. Randall C. O'Reilly , Michael J. Frank . 2006. Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal GangliaMaking Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia. Neural Computation 18:2, 283-328. [Abstract] [PDF] [PDF Plus] 9. Michael J. Frank, Eric D. Claus. 2006. Anatomy of a Decision: Striato-Orbitofrontal Interactions in Reinforcement Learning, Decision Making, and Reversal. Psychological Review 113:2, 300-326. [CrossRef] 10. Pieter R. Roelfsema , Arjen van Ooyen . 2005. Attention-Gated Reinforcement Learning of Internal Representations for ClassificationAttention-Gated Reinforcement Learning of Internal Representations for Classification. Neural Computation 17:10, 2176-2214. [Abstract] [PDF] [PDF Plus] 11. Christian D. Swinehart , L. F. Abbott . 2005. Supervised Learning Through Neuronal Response ModulationSupervised Learning Through Neuronal Response Modulation. Neural Computation 17:3, 609-631. [Abstract] [PDF] [PDF Plus] 12. M. Meeter, C. E. Myers, M. A. Gluck. 2005. Integrating Incremental Learning and Episodic Memory Models of the Hippocampal Region. Psychological Review 112:3, 560-585. [CrossRef]
13. Yuko Munakata, Jason Pfaffly. 2004. Hebbian learning and development. Developmental Science 7:2, 141-148. [CrossRef] 14. Matthew Botvinick, David C. Plaut. 2004. Doing Without Schema Hierarchies: A Recurrent Connectionist Approach to Normal and Impaired Routine Sequential Action. Psychological Review 111:2, 395-429. [CrossRef] 15. Yuko Munakata, James L. McClelland. 2003. Connectionist models of development. Developmental Science 6:4, 413-429. [CrossRef] 16. Xiaohui Xie , H. Sebastian Seung . 2003. Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered NetworkEquivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network. Neural Computation 15:2, 441-454. [Abstract] [PDF] [PDF Plus] 17. Michael J. Frank, Jerry W. Rudy, Randall C. O'Reilly. 2003. Transitivity, flexibility, conjunctive representations, and the hippocampus. II. A computational analysis. Hippocampus 13:3, 341-354. [CrossRef] 18. Geoffrey E. Hinton . 2002. Training Products of Experts by Minimizing Contrastive DivergenceTraining Products of Experts by Minimizing Contrastive Divergence. Neural Computation 14:8, 1771-1800. [Abstract] [PDF] [PDF Plus] 19. Nicolas P. Rougier, Randall C. O'Reilly. 2002. Learning representations in a gated prefrontal cortex model of dynamic task switching. Cognitive Science 26:4, 503-520. [CrossRef] 20. Denis Mareschal, Scott P. Johnson. 2002. Learning to perceive object unity: a connectionist account. Developmental Science 5:2, 151-172. [CrossRef] 21. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 22. Randall C. O'Reilly, Jerry W. Rudy. 2001. Conjunctive representations in learning and memory: Principles of cortical and hippocampal function. Psychological Review 108:2, 311-345. [CrossRef] 23. Randall C. O'Reilly, Jerry W. Rudy. 2000. Computational principles of learning in the neocortex and hippocampus. Hippocampus 10:4, 389-397. [CrossRef] 24. Randall C. O'Reilly, Yuko MunakataPsychological Function in Computational Models of Neural Networks . [CrossRef] 25. Randall C O'Reilly, Yuko MunakataComputational Neuroscience: From Biology to Cognition . [CrossRef] 26. Paul MunroBackpropagation . [CrossRef]
Communicated by John Platt
Effects of Nonlinear Synapses on the Performance of Multilayer Neural Networks G . Dundar F-C. HSU K. Rose Electrical, Computer, and Systems Engineering Department, Centerfor Integrated Electronics and Electronics Manufacturing, Rensselaer Polytechnic Institute, Troy, NY 12180 U S A
The problems arising from the use of nonlinear multipliers in multilayer neural network synapse structures are discussed. The errors arising from the neglect of nonlinearities are shown and the effect of training in eliminating these errors is discussed. A method for predicting the final errors resulting from nonlinearities is described. Our approximate results are compared with the results from circuit simulations of an actual multiplier circuit. 1 Introduction
Analog implementations of neural networks have many desirable properties such as small size or high speed. However, there are also many problems in designing analog neural network circuitry about which relatively little work has been done. Among these problems are quantization (Xie and Jabri 1991,1992; Dundar and Rose 1995),component variations due to fabrication (Dolenko and Card 1993), circuit nonidealities in neurons (Frye et al. 1991), noise (Frye et al. 1991), and circuit nonidealities in synapses. Many multipliers have been developed for use in neural networks as synapses. However, most of these have sacrificedspeed for area or area for accuracy. The more linear or more accurate multipliers have a high number of transistors (Qin and Geiger 1987; Mead 1989; Rossetto et af. 1989). Increasing the size of the multiplier is not a good solution because synapses are the most common elements in neural networks and largely determine the size of the network. In this paper we will demonstrate in Section 2 the extent to which a quadratic synapse function degrades neural network performance when the network is trained with ideal, linear synapses and the resulting weights are downloaded to a network with quadratic synapse functions. Two typical applications are considered-function approximation and pattern classification. Section 3 shows that much better results are obtained if training is done with quadratic synapse functions. We also indicate Neural Computation 8, 939-949 (1996) @ 1996 Massachusetts Institute of Technology
940
G. Dundar, F-C. Hsu, and K. Rose
how the backpropagation algorithm can be modified to allow training with nonlinear synapses. Section 4 demonstrates a theory that predicts the dependence of error on the amount of nonlinearity. The theory is based on treating the error introduced by the nonlinearity as equivalent to quantization noise. In Section 5 we discuss MOSFET implementations of multiplier circuits and show that our simplified analysis is a reasonable approximation of the behavior of a Gilbert multiplier. Section 6 draws conclusions.
2 The Effects of Synapse Nonlinearity on Neural Networks
In this study, we describe a nonlinear synapse by the function, y = x ( w + y is the output of the synapse, x is the input, w is the weight, and a is what we call the nonlinearity coefficient. The response of a synapse circuit is typically nonlinear in both the input and the weight. If the nonlinearity is in the input, we can use the function, y = (x + a s ” ) z ~where ~ n is the degree of the nonlinearity. Mathematically, from the perspective of y, both functions have comparable effects. However, there is an important practical distinction. The amplitude of x can be restricted to keep the multiplier output linear in x, and this can be maintained by the squashing effect of the neuron function. zu, on the other hand, needs to be allowed a wider range for effective training. When a is zero, there is no nonlinearity and the network behaves ideally. We have chosen this function for several reasons. One of these is that the nonlinearity gets larger as z17 or x get larger. This is often the case in practical designs, as we will see in Section 5. Second, it is possible to generate both convex and concave transfer functions by changing the sign of a. Also, one can easily adjust the amount of nonlinearity present in the function by changing the magnitude of a. One can make similar arguments for cubic, IZ = 3, nonlinearities as we will see in Section 4. Quadratic or cubic nonlinearities are reasonable first approximations to circuit nonlinearities as we will see in Section 5. Our first experiment is representative of training a neural network on a commercial software package for the ideal case (a = 0) and then downloading the weights to a neural network chip (where a is nonzero). To determine the effects of synapse nonlinearity on overall performance, we have run a network with simple quadratically nonlinear synapses on two typical examples. One of these is a sine generator, which represents a function approximation application, and the other example is a 4-bit A / D converter, which represents a pattern classification application. The sine generator has 1 input, 10 neurons in the hidden layer, and 1 output neuron. The A / D converter has 1 input, 15 neurons in the hidden layer, and 4 output neurons. Both networks have sigmoid neurons and were trained using standard backpropagation. azc?’), where
Nonlinear Synapses in Neural Networks
941
AID converter with nonlinear synapses y=w(x+a'x'x)
80.0 -
60.0 -
40.0 -
20.0 -
-1.0
-0.5
0.0 nonlinearitywefficient (a)
0.5
0
Figure 1: A/D conversion with nonlinear synapses-forward propagation (ideal circuit training) and nonlinear synapse training simulations.
For both examples, the performance showed a very marked dependence on a. Even for a small value of a like 0.1, the misclassification ratio for the A / D converter was more than 25% (Fig. 1). This is completely unacceptable since even 5% misclassification can determine failure or success in many cases. A similar curve was observed for the sine generator (Fig. 2). For the case of the sine generator, the error is the rms error between an ideal sine wave and the sine wave obtained from the simulations. This rms error is obtained by sampling the ideal sine wave and the simulation results at a large number of points.
3 Training Networks with Synapse Nonlinearity
Another experiment was run. This experiment is representative of training a network on custom software with nonlinear synapses modeled accurately by the software. (The analog neural network simulator we developed, ANNS, was specifically designed to facilitate this so that we could examine the effects of particular circuit realizations on neural network behavior.) In this case, the errors were reduced drastically and were
G. Dundar, F-C. Hsu, and K. Rose
942
Sine generator with nonlinear synapses y=w(x+a'x'x)
0.40
! t
030
5
iomard prop
II
020
010
L I 1
/ , "
\ Training wlth nonlinear synapses
0 00
10
-0 5
00
05
0
nonlinearity coefficient (a)
Figure 2: Sine generator with nonlinear synapses-forward propagation (ideal circuit training) and nonlinear synapse training simulations.
more or less constant over a large range of a values. However, there were problems in training networks with large negative as as shown in Figures 1 and 2. This is probably due to the fact that high positive synapse outputs are not obtainable with small weights in this case. The modification in the backpropagation algorithm to obtain training with nonlinear synapses is not tricky. It can be obtained from the original derivation of the backpropagation algorithm, which involves the minimization of the square error, by substituting the actual synapse function for simple multiplication. This substitution gives the result that one must use the nonlinear synapse when forward propagating the signal and the derivative of the synapse function when backpropagating the error. In Lont and Guggenbiihl (19921, the authors have also shown that their nonlinear synapse functions converge with a similar modification in the backpropagation algorithm. This modification is independent of the form of the synapse function and can be applied irrespective of the order and type of nonlinearity.
Nonlinear Synapses in Neural Networks
943
4 Predicting the Effects of Synapse Nonlinearity on System Performance
Having simulated the effects of synapse nonlinearity on the behavior of networks, we would like to be able to estimate it theoretically. Our approach to predicting the effects of nonlinear synapse functions is to view the deviation from linear synapse behavior as noise. Therefore, we define a "signal," which is the average value of the output of the linear synapse, and an "error," which is how much the output deviates from the output of a standard linear synapse. For a quadratic nonlinearity in x the "signal" is xw and the "error" is ax2w. By regarding the errors due to the nonlinearity as equivalent to quantization noise, we can use previously derived results to analyze this situation. The effects of quantization of weights on neural network behavior have been studied by Xie and Jabri (1991, 1992). Their work has been extended by Dundar and Rose (1995). For a quantized, limited resolution network, the "signal" is given by xw and the "noise" by XAW,where Aw is the noise in the weights due to quantization. Equating the error terms for nonlinearity and quantization gives
ax2w = XAW
(4.1)
axw = Aw
(4.2)
or
For an arbitrary network w and Aw are taken to be random, uniformly distributed variables with zero mean. x, the signal amplitude, is taken to be uniformly distributed between -1 and 1 and independent of w. To be useful, equation 4.2 should define equivalent distributions of values with the same mean, 0, and variance. Aw ranges from -A12 to A/2 where A is the quantization level; it has a mean of 0 and a variance A2/12. w ranges from -W, to W,, and has a mean of 0 and a variance Wkax/3. 2Wma, = (2N- 1)A sz 2NA where N is the number of bits of quantization. To this approximation, equating variances gives a relation between N and a.
zN= &/a
(4.3)
Equation 4.3 is our principal result because it relates the nonlinearity coefficient, a, to the number of bits of quantization, N. Using statistical methods, we have derived expressions that will calculate both the mean and the standard deviation of the signal itself and the noise at any point in a network with quantized synapse weights, once the quantization level and the properties of the network are known. One can predict the effects of the synapse nonlinearity by straightforward substitution of the equivalent number of bits into the expressions in Dundar and Rose
G. Diindar, F-C. Hsu, and K. Rose
944
Nonlinear synapses Theoretical and experimental results
~__---
,’,, -
nonlinearity coetficienl
Figure 3: A / D conversion with nonlinear synapses-forward propagation (ideal circuit training) simulation and calculations.
(1995). The expressions are too long and complicated to reproduce here, but they predict an exponential dependence of SNR on the number of bits of quantization. Predicted and actual results are graphed in Figures 3 and 4. Figure 3 was generated using results from the A / D converter while Figure 4 was generated using results from the sine generator. The success of prediction is evident from these graphs. The prediction deviates from the actual results only at very high error values. However, this is to be expected. Our model for noise breaks down for high values of noise since it assumes that noise is small compared to the signal. However, this is not a drawback, since predicting a misclassification ratio of over 80% accurately has no practical significance. The important prediction range is the range where the errors are small. Some work has been done simulating these neural networks with lookup tables generated from SPICE simulations; the results indicated slight (less than 10%) improvements to the values in the above graphs. We will consider this point in much greater detail in Section 5 where we consider the behavior of actual multiplier circuits. A similar analysis can be carried out for a cubic nonlinearity. Here we would equate bzux’
=~ 1 7 ~ 1
(4.4)
Nonlinear Synapses in Neural Networks
945
Nonlinear synapses Theoretical and experimental results
0.40
nonlinearity rnefficient
Figure 4: Sine generator with nonlinear synapses-forward propagation (ideal circuit training) simulation and calculations.
Equating variances gives 2N z &/b
(4.5)
5 Nonlinearities in Multiplier Circuits
The standard multiplier used in many applications in all areas of electrical engineering is the famous Gilbert multiplier (Gilbert 1968). However, this multiplier has not enjoyed as much success in its MOSFET version as in its BJT version. Kub et al. (1990) have shown that the MOSFET Gilbert multiplier has a total harmonic distortion of greater than 5% in a small input range of -2 to 2 V even with a supply voltage as high as 10 V. Furthermore, the multiplier does not operate symmetrically. The characteristics of these multipliers can be found in Mead (1989), Gilbert (1968), Kub et aZ. (1990), and Holler et al. (1989). Tsividis et al. (1986) and Tsividis and Satyanarayana (1987) have developed synapse circuits that are much smaller but that also exhibit nonlinearities. In this section we will focus on the behavior of the standard Gilbert multiplier. A schematic diagram of a MOSFET implementation of the Gilbert multiplier is shown in Figure 5. This multiplier was laid out in the MOSIS
G. Dundar, F-C. Hsu, and K. Rose
946
Iout-
7
vx+
7
. vx-
Figure 5: MOSFET Gilbert multiplier circuit diagram.
2 Lim SCMOS n-well technology using MAGIC, which was also used to extract the SPICE file for the circuit. W/L = 3 for all transistors. Typically, the differential output currents are connected to an operational amplifier; in our simulation they are both connected to 5 V and V,, is set at -5 V to ensure a high gain. The difference in output currents is approximately proportional to the product AV,AV,, where AV, = V,+- V,- and AV,, = V?(+ - v7(,-. The circuit was simulated using SPICE and the resulting IV characteristics are shown in Figure 6. To simplify the situation we have taken V,- = 0 and swept V,+ from -3 to +3 V. For positive weights V , +is set to 0 V and VIL,-is set to negative voltages, while for negative weights V,,- is set to 0 V and V,,,+is set to negative voltages. Figure 6 shows substantial nonlinearities for both x = AV, and w _= AV,,when their magnitudes are greater than 1 V. We have found that these characteristics are reasonably fit, over the ranges -1 < x < 1 and 0 < w < 5, by a quadratic nonlinearity in w with
Nonlinear Synapses in Neural Networks
-3.ov
-2ov
-1.ov
0.0v
947
1.ov
20V
3.0V
vxr
Figure 6: IV characteristicsfor the Gilbert multiplier. u = 0.25. The deviations of the characteristics from a quadratic form
range from f l to 425% with an rms error of 15%. The rms deviations for w = 1,2,3,4,and 5 are 5.7,13,11,24, and 5.1%, respectively. Thus, the deviation from a quadratic nonlinearity does not increase monotonically with w. Several experiments were run training a 1-5-1 neural network for the sine generator. In the first experiment the weights were found by backpropagation with linear synapses (ideal multiplication) and then run in forward propagation with the same linear synapses. It should be noted that in addition to the single input, bias inputs were required for the hidden layer and the output layer. Thus, a total of 2 x 5 weights were required for the hidden layer and 5 1 weights were required for the output neuron. The range of weights, w,was from -17.9 to t14.4. The neural network had an rms error of 0.010 for nine evenly spaced training inputs and an rms error of 0.040 for test inputs taken halfway between the training inputs. In the second experiment the weight inputs, w,were found by backpropagation with quadratic synapses and then run in forward propagation with the quadratic synapses. The range of weights was from -10.2 to 9.0, corresponding to w from -6.6 to 6. The neural network had an rrns error of 0.0098 for the training inputs and 0.052 for the test inputs. From these experiments we see that the test errors for trained quadratic synapses are comparable to those for linear synapses. In the third experiment the weights, w,derived from backpropagation with linear synapses were run in forward propagation with quadratic synapses. The neural network had an rms error of 0.32 for the training
+
G. Diindar, F-C. Hsu, and K. Rose
948
inputs and 0.46 for the test inputs. This confirms our previous conclusion that weights derived from linear synapses or ideal multiplication are a poor choice to download to nonlinear synapses. We also made a more accurate approximation to the nonlinear characteristics of Figure 6 using nested sigmoids. That is, we derived a close fit to the nonlinear characteristics of the synapse by using ANNS to train a 1-2-1 neural network with linear synapses. The rms error was 0.01 for training inputs and 0.088 for test inputs. (A fourth-degree polynomial fit had an rms error of 0.03 for the fitting inputs, which were the same as the training inputs, but a worse rms error of 0.14 for the test inputs.) An advantage of this approach is that the resulting 1-5-1 neural network with nested sigmoid synapse functions could be run in forward propagation using Maple with the weight inputs, 70, derived from backpropagation with quadratic synapses. The result was rms errors of 0.05 for the training inputs and 0.06 for the test inputs. The errors with test inputs are very close to those in our second experiment when values of zu derived from quadratic synapses were run in forward propagation with quadratic synapses. Thus, values of 70 derived from the quadratic approximation work well with the actual nonlinear multiplier characteristics. 6 Conclusions ~-
We have shown that weights derived for synapses that are ideal multipliers, as is customarily the case in neural network simulators, do poorly when downloaded to nonlinear synapses, which are more representative of actual analog multiplier circuits. We have developed a theory that allows us to predict this behavior for quadratic and cubic nonlinearities. This way, one can impose constraints on system inputs to maintain low error rates. Further, we have indicated that weight inputs, u’, for nonlinear synapses can be derived by backpropagation and have demonstrated, for quadratic synapses, that these give low errors. We have examined a commonly used MOS multiplier, the Gilbert multiplier, in some detail and have shown that its characteristics are reasonably described by a quadratic nonlinearity in w. Values of 711 obtained by backpropagation using quadratic synapses give comparable output errors for more accurate models of the synapse characteristics. Thus, an ideal multiplier is not necessary for neural networks as long as the synapse function is known. This will allow us to design smaller, nonlinear synapse circuits that require fewer transistors. References
-
Dolenko, B. K., and Card, H. C. 1993. Neural learning in analogue hardware: Effects of component variation from fabrication and from noise. Elect. Lett. 23, 693494.
Nonlinear Synapses in Neural Networks
949
Dundar, G., and Rose, K. 1995. The effects of quantization on multilayer neural networks. l E E E Trans. Neural Networks 6, 1446-1451. Frye, R. C., Rietman, E. A., and Wang, C. C. 1991. Back-propagation learning and non-idealities in analog neural network hardware. l E E E Trans. Neural Networks 2, 110-117. Gilbert, 8. 1968. A precise four-quadrant multiplier with subnanosecond response. IEEE ]. Solid State Circuits 3, 365-373. Holler, M., et al. 1989. An electrically trainable artificial neural network (ETA") with 1024 "Floating Gate" synapses. Proc. IEEE INNS Int. Joint Conf. Neural Networks 11-191-11-196. Kub, F. J. et al. 1990. Programmable analog vector-matrix multipliers. I E E E J. Solid State Circuits 25, 207-214. Lont, J. B. and Guggenbuhl, W. 1992. Analog CMOS implementation of a multilayer perceptron with non-linear synapses. l E E E Trans. Neural Networks 3, 457465. Mead, C. 1989. Analog VLSI and Neural Systems. Addison-Wesley, Reading, MA. Qin, C. C., and Geiger, R. L. 1987d. A *5V analog multiplier. IE E E ].Solid State Circuits 22, 1143-1146. Rossetto, O., et al. 1989. Analog VLSI synaptic matrices. l E E E Micro Mag. 56-63. Tsividis, Y., and Satyanarayana, S. 1987. Analogue circuits for variable-synapse electronic neural networks. Elect. Lett. 23, 1313-1314. Tsividis, Y., et al. 1986. Continuous-time MOSFET-C filters in VLSI. I E E E Trans. Circuits and Sys. CAS-33 (2), 125-140. Xie, Y., and Jabri, M. A. 1991. Analysis of effects of quantisation in multilayer neural networks using statistical method. Elect. Lett. 27, 1196-1198. Xie, Y., and Jabri, M. A. 1992. Analysis of the effects of quantisation in multilayer neural networks using a statistical method. IEEE Trans. Neural Networks 3, 334-338.
Received December 14, 1993; accepted November 6, 1995
This article has been cited by: 2. Gunhan D��ndar, Kenneth RoseNeural Chips . [CrossRef]
Communicated by Alain Destexhe
Modeling Slowly Bursting Neurons via Calcium Store and Voltage-Independent Calcium Current Teresa Ree Chay Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 25260 U S A
Recent experiments indicate that the calcium store (e.g., endoplasmic reticulum) is involved in electrical bursting and [Ca2+Iioscillation in bursting neuronal cells. In this paper, we formulate a mathematical model for bursting neurons, which includes Ca2+ in the intracellular Ca2+ stores and a voltage-independent calcium channel (VICC). This VICC is activated by a depletion of Ca2+ concentration in the store, [Ca2+lCs. In this model, [Ca2+lCs oscillates slowly, and this slow dynamic in turn gives rise to electrical bursting. The newly formulated model thus is radically different from existing models of bursting excitable cells, whose mechanism owes its origin to the ion channels in the plasma membrane and the [Ca2+Iidynamics. In addition, this model is capable of providing answers to some puzzling phenomena, which the previous models could not (e.g., why CAMP,glucagon, and caffeine have ability to change the burst periodicity). Using mag-fura-2 fluorescent dyes, it would be interesting to verify the prediction of the model that (1)[Ca2+lCs oscillates in bursting neurons such as Aplysia neuron and (2) the neurotransmitters and hormones that affect the adenylate cyclase pathway can influence this oscillation. 1 Introduction
The action potentials of some Tritonia and Aplysiu neurons appear in bursts consisting of fast spikes in regular sequences separated by silent periods in which the cell hyperpolarizes (Strumwasser 1968). The time scale of a burst is on the order of tens of seconds, while the spikes have a millisecond time scale. The bursts will continue with the same duration and frequency as long as the external environment is maintained at steady state. The bursting reflects movements of ions that pass through the channels embedded in the plasma membrane of these cells. The ionic movements are caused by the cellular events that take place in cytoplasm and intracellular calcium stores. The neurotransmitters and hormones that affect the adenylate cyclase pathway can influence the bursting (Levitan and Levitan 1988; Scholz et ul. 1989), which is a convincing indication that the bursting has its origin in cellular processes. In Neural Computation 8, 951-978 (1996)
@ 1996 Massachusetts Institute of Technology
952
Teresa Ree Chay
ganglion neurons, the CaZ”-induced Ca2+releasing (CICR) channel in the Ca2’ store controls the rhythmic membrane activity under the influence of caffeine (Kuba and Nishi 1976). In the past, a number of workers have proposed mathematical models to explain bursting activity and its response to external environmental changes. Some bursting models implicate as the cause of bursting: (1) a Ca2’-sensitive K’ channel (Plant 1981; Chay 1983,1985a, 1986), (2) a Ca2i channel that is inactivated by intracellular Ca2+ ion (Chay 1987; Chay and Kang 1987; Rinzel and Lee 1987; Chay and Cook 1988; Canavier et al. 1991),or (3)both (Chay 1985b). In this class of models, [Ca”], changes slowly, and this slow [Ca2”-Iichange is what causes the slow underlying wave required for the bursting. Such models, however, are not in accord with recent experiments, which show that (Ca’+Ii responds fast (within tens of milliseconds) upon depolarization of membrane potential in various cell types (Bokvist ef al. 1995; Friel and Tsien 1994; Lipscombe et al. 1988; Llano et al. 1991; Ozaki et al. 1991; Wier et al. 1995; Valdeolmillos et 01. 1989). To account for the fast [Ca’’], response that accompanies depolarization, several mathematical models have been proposed. These models incorporate (1)a Ca2+channel that contains a time- and voltage-dependent gating variable, f , as in Hodgkin-Huxley’s h-gate in the Naf current (Chay 1990a; Chay and Lee 1990), ( 2 ) a gating process that describes the binding and unbinding of the Ca2+ ion at its highly charged receptor site of the Ca2+-sensitive K+ channel (Chay 1983), (3) a gating process that describes the binding and unbinding of the Ca2’ ion at its highly charged receptor site of the Ca2+ channel (Chay 1990b; Chay and Fan 1993), and (4) use-dependent blocking of the Ca2+ channel by a Ca2+-dependent protein (e.g., calmodulin) when the channel is in an open configuration (Chay 1993a). The essence of these models is that while the [Ca2+Iiis dynamically fast, the activation process of IK.Ca or the inactivation gating process of Ica is slow (ie., takes place in the order of tens of seconds). The models give rise to [Ca2+Iioscillation, which varies concomitantly with electrical bursting. However, the channel gating process is in general a fast process-not tens of seconds as required to generate the bursting. Thus, it seems unlikely that the underlying slow wave seen in cells such as Aplysia and Helix neurons owes its origin to the gating process of the pacemaker current. This class of models (i.e., the models based on the channel gating process) may be more appropriate for describing the bursting mechanisms involved in those neurons whose burst periodicity is on the order of a few hundred milliseconds, e.g., CA3 hippocampal pyramidal neurons, thalamocortical relay neurons, and thalamic reticular neurons. Indeed, several interesting models have been proposed for these cell types, which utilize the gating processes of a pacemaker current (Destexhe et al. 1994; McCormick and Huguenard 1992; Traub et al. 1991). Recent experiments indicate that in excitable as well as nonexcitable
Modeling Bursting Neurons via Calcium Store
953
cells, depletion of the luminal calcium concentration, [Ca2+lCs, in a Ca2+ store (e.g., endoplasmic reticulum) activates a voltage-independent inward current (Parekh et al. 1993; Randriamampita and Tsien 1993; Putney 1993). Some experiments indicate that this current (known as Icrac,the calcium release-activated Ca2+current) is carried by calcium ions through a voltage-insensitive channel (Hoth and Penner 1992). Depletion of luminal Ca2+ causes a release of a diffusible messenger molecule known as CIF (Ca2+influx factor), and this molecule triggers influx of external calcium through the Icr,, channel (Randriamampita and Tsien 1993). Some models implicate the GTP-bound g-protein as a slow dynamic variable (e.g., Cuthbertson and Chay 1991). Others implicate that [Ca2+Ics is a slow dynamic variable that gives rise to the [Ca2+]ispikes (Kuba and Takeshita 1981; Meyer and Stryer 1988; Goldbeter et al. 1990; Somogyi and Stucki 1991). The slow process of [Ca2+]cs agrees closely with the time scale required to generate slow bursting (i.e., tens of seconds). The emptying of calcium stores and its influence on electrical behavior and [Ca2+Iiin neurons were first studied by Kuba and Takeshita (1981), and their work was further refined by Friel and Tsien (1992). Chay (1991, 1993a) studied the effect of the calcium stores on electrical bursting activity and [Ca2+]ioscillation in pancreatic P-cells. The present work is concerned with the effect of the calcium stores on bursting excitable cells by including the calcium release-activated Ca2+current. By formulating a mathematical model, I am particularly interested in answering the following questions: What is the mechanism involved in the slow wave of neurons such as Tritonia and Aplysia neurons? How does the calcium released from the intracellular calcium store affect the shape of [Ca2+Ii? How do the agonists that are involved in the adenylate cyclase pathway affect the bursting? How does the mode of oscillation change when the key channel property is altered? To seek answers to the above questions, I reformulate my earlier bursting neuronal model (Chay, 1983, 1985a, 1990a,b) by incorporating the voltage-insensitive Ca2+channel (VICC) and [Ca2+]cs dynamic. In the reformulated model, the luminal calcium concentration, [Ca2+]cs, oscillates slowly, and this slow dynamic in turn gives rise to electrical bursting and [Ca2+Iioscillation via VICC. Although the upstroke of [Ca2+]iis fast, its falling phase is slow because of a steady release of luminal calcium from the store. To demonstrate the essential role that VICC plays in bursting, I first formulate a minimal model (Model 1 in Appendix I) that contains four types of the currents-a voltage-dependent calcium current, a delayed rectifying potassium current, a voltage-independent calcium current that is activated by depletion of luminal Ca2+,and a leak current. This minimal model is based on my 1985 model (Chay 1985), where I replace the Ca2+-sensitiveK+ current, IK,Ca, by the voltage-independent calcium current (Ivrcc). In this minimal model, lvrcc is solely responsible for the genesis of the slow wave. Later, I refine this minimal model by including
Teresa Ree Chay
954
a voltage-dependent Ca2+current (Islow)that is inactivated by intracellular Ca2+ ion (see Model 2 in Appendix 11). In the latter model, a combined effect of lvrcc and Islow is responsible for the slow wave. The abstract of this work has appeared elsewhere (Chay 1991, 1995a). The synopsis of the work where l ~ l c creplaces INs (the cationic non-selective inward current) has appeared in the NOLTA '95 Proceedings (Chay, 1995b). 2 The Model
The plasma membrane contains a voltage-dependent Ca2+ current (VDCC), which opens when the membrane is depolarized, permitting extracelluiar Ca2+ ion to come into the cell. The membrane also contains a voltage-independent Ca2+ channel (VICC), which is activated when iCa2-Ic, becomes low. Since I,,,, found in nonexcitable cells may not have exactly the same property as that in excitable cells, we name the depletion-activated Ca2+ current in our model Ivlcc. Figure 1 describes the interaction among these channels (VDCC and VICC), the receptors embedded in the plasma membrane, and the intracellular calcium store (CS). In this figure, g is the GTP-bound g-protein, Rec is a receptor, CCC is a calcium channel cluster that contains VDCC, and CICR is a channel in the CS, which releases the luminal Ca2+by the calcium-induced calciumrelease mechanism (Fabiato 1983). Binding of the agonist ( e g , hormone and neurotransmitter) to the receptor raises levels of cyclic adenosine monophosphate (CAMP).These messengers in turn enhance the release of luminal calcium from the CS by activating the CICR channel. The CICR channel can be also influenced by such agents as caffeine. In addition to these Ca'- channels, the plasma membrane contains a time-dependent delayed rectifying K' channel. Note that [Ca"], in this paper refers to the Ca2+concentration localized in a microdomain (see Fig. 1). The change in membrane potential with time can be described by a parallel circuit of a Hodgkin-Huxley type (Hodgkin and Huxley 1952): dV
-c,- d t
= X Iionic
where C, is the membrane capacitance, and I,,,,,, is the total ionic current. In the minimal model that we treat first (Model 1 in Appendix I), I,,,,,, consists of four ionic components-a voltage-dependent Ca2+ current (IvtXc), a voltage-independent calcium current (IVICC), a delayedrectifying time-dependent K T current ( I K ) , and a background leak current ( 1 ~ ) . Appendix I gives the explicit forms of these four currents and their parametric values. As shown in this appendix, the VDCC has a voltage-dependent fast activating nz-gate and a somewhat slower voltage-dependent inactivating h-gate. The delayed rectifying K+ channel contains the n-gate, which opens when the membrane is depolarized. In the more refined model (Appendix 11), we treat two hypotheses(1) VDCC consists of two types of current, N-and L-type currents (hy-
Modeling Bursting Neurons via Calcium Store
955
w E 0
- ca2+
Figure 1: Working hypothesis that explains how a voltage-dependent Ca2+ channel (VDCC), a voltage-independent Ca2+ channel (VICC),the two types of pumps and agonist-receptor complex are involved in controlling compartmentalized intracellular Ca2+ in the microdomain underneath the plasma membrane. Here, g, Rec, CCC, CICR, and CS stand for GTP-bound g-protein, receptor, calcium channel cluster, calcium-induced Ca2+-release channel, and Ca2+ store, respectively.
pothesis 1) and (2) VDCC consists of only an L-type current; in addition, the plasma membrane contains Na+ channels, which is distributed evenly throughout the membrane (hypothesis 2). The L-type channel contains a d-gate, which opens slowly upon depolarization, and a n f gate, which closes when [Ca2+Iibecomes high (Kramer and Zucker 1985). The N-types current in hypothesis 1 follows the kinetic mechanisms described by Nowycky et al. (1985), i.e., it contains a fast activating voltagedependent rn-gate and a somewhat slower voltage-dependent inactivat-
Teresa Ree Chay
956
ing h-gate. The Na’ channel follows the same mechanism as that of the N-type. We name the L-type current as I,),,, and the N-type or the Na+ current as Ifa\t. The leak current in Model 1 is a true leak, while that in model 2 is a background voltage-independent inward rectifying K+ current, f k b . The gating variables, h . d , and H , are the dynamic variables in that their expression can be obtained by solving the following first-order differential equation (Hodgkin and Huxley 1952):
where y represents h, d, and 1 1 . In the above equation, yy is y at its steady-state value and is assumed to take the Boltzmann form
yx
=
I / [I + exp
(?I]
where V,,is a half-maximal potential, and S , is the slope of yx at V = V,. In equation 2.2, T“ is the relaxation time constant, which can be expressed by
where av is a value ranging from zero to 1. Note that if av = 0 5. T ~ ,forms a symmetric bell-shaped curve. While the kinetic mechanism of the h-, d-, and n-gates follows equation 2.2, the opening of thef-gate follows a simple Michaelis-Menten form as shown in Appendix 11. The gating variable, p, of VICC has not yet been characterized fully (ABSTRACT, The 1995 Gordon Research Conference on Calcium Signalling, Henniker, NH, USA, 1995). What is known about this current is that its activity reaches a maximum when [Ca”],, is low and a minimum when [Ca2+lCs is high. For simplicity, p is assumed to take a MichaelisMenten form:
where Kvlcc is the dissociation constant for Ca” ion from the receptor sites in the VICC. A rise of [Ca’+], is due to the following two sources-(1) influx of extracellular Ca’+ ion through the VDCC and ( 2 ) a release of luminal Ca’” by the calcium-releasing channel (CICR in Fig. 1). The fall of [Ca”], is due to the two sources also-efflux of Ca’’ ion via the Ca’+-ATPase in the plasma membrane and sequestration of intracellular Ca*+ into the Ca2* store. Accordingly, the change of [Ca2+],with time can be described by Chay (1993):
Modeling Bursting Neurons via Calcium Store
957
where fcicr describes a release of luminal Ca2+through the calcium release channel, and pcy measures the free calcium content in the cytosol (see its explicit expression given in Appendix 111).The first and second terms in the curly bracket describe the situation where external Ca2+ enters the microdomain through the VDCC and VICC, respectively, where 41 and 42 are the conversion factors that convert the electrical gradient to the chemical gradient (and include the surface to the volume ratio). Different symbols for 4's were used here since IVICC may directly communicate with the store. In such a case 4 2 = 0. The fourth term includes the Ca2+-ATPasepump activity embedded in the plasma membrane as well as diffusion of [Ca2+]ifrom the microdomain, where k,, is the sum of the rate constant for the pump and the diffusion rate. The final term is Ca2+-ATPasepump activity embedded in the membrane of the calcium store, where kcsp is the rate of the sequestration of [Ca2+ji.Pump activity in Model 1 is assumed to depend on [Ca2+Ii(see Appendix I for the expression of kcsp).For Model 2, Kcsp is assumed to be much less than [Ca"], so as to eliminate the free parameter Kcsp. In any case these models are not very sensitive to the expression used for kcsp. In this model, a rise of [Ca2+jCs is due to influx of intracellular Ca2+by the action of Ca2+-ATPaseembedded in the membrane of the store. On the other hand, a decrease of [Ca2+jCs owes its origin to the Ca2+-releasing channel in the calcium store by the CICR mechanism. Thus, the change of [Ca2+lCs with time can be described by the following equation:
where pcs measures the free calcium content in the calcium store (see Appendix 111). The third term inside the curly bracket describes the case where the VICC channel communicates directly with the calcium store, permitting external Ca2+to enter the store. If there is no direct communication, $cs is set to be zero. Model 1 assumes that IVICC has no direct communication with the store. Thus, 41 = 4 2 and 4cs = 0. On the other hand, Model 2 assumes a "direct" pathway between the extracellular medium and the intracellular store. This makes 4 2 = 0 and dCsdifferent from zero. The latter assumption (i.e., the direct pathway) can be easily removed without affecting the results. Luminal Ca2+is released to the microdomain from the calcium store via the CICR mechanism, and we propose Jcicr takes the following mathematical expression: I c m = kcicr
[ca2+Ii ([Ca2+jcs - [ca2+]i) Kcs [Ca2+]i
+
where kcicr is the release rate of [Ca2f]cs,and Kcs is the dissociation constant for [Ca2+],.We note that Kuba and Takeshita (1981) and Somogyi and Stucki (1991) used a higher power for the [Ca2+]iso as to generate
958
Teresa Ree Chay
the [Ca2+jcb oscillation. Although their expression can also be used for our Jclcr, we used equation 2.5, which has a lesser sensitivity for [Ca"],. To summarize, Model 1 contains five dynamic variables, V , h, n, [Ca2+Ici,and [Ca2+ll,and Model 2 contains one additional variable, d. The rates of the opening of the h-, d-, and rr-gates are determined by the relaxation time constants, Th. ~ d and , r,, respectively. How fast [Ca2+], and iCa2+jC5 change with time is determined by pclrand pcs, respectively. In particular, pcs controls the periodicity of bursting, and pcy controls the /Ca2+],spike amplitude. The amplitude of electrical spike can be influenced by 7 h . It may be informative to mention that the number of differential equations can be reduced (without changing the results) by using a rapid equilibrium assumption for the [Ca2+],dynamic (see Appendix I11 for the derivation). 3 Results
Figure 2 is the result obtained using Model 1 of Appendix I. Here, the lower traces show the time course of membrane potential (solid) and that of [Caz+]cs(dash). The upper trace, on the other hand, shows the time course of [Ca2+li.The electrical burst has an appearance very similar to that observed in the H d i x neurons. Note that [Ca2f]csoscillates slowly, and this slow oscillation causes the burst of activity in V and [Ca2']i. In conjunction with the [Ca2+lCsoscillation, the membrane potential oscillates between -47 mV (the silent phase) and -43 mV (the plateau phase). On the top of the plateau, fast electrical spikes appear. Note that in this model, variations of up to 14% in [Ca2+]cs (between 21.5 and 25 I'M) cause the bursting behavior. Although electrical bursts simulated from the present model are very similar to those generated from my earlier model (Chay 1985a), the shape of [Ca2+],is quite different. Note that [Ca2+Iiincreases rapidly during the upstroke of V and decreases slowly during the silent phase. During the plateau, fast Ca2' spikes are generated with an amplitude of about 0.2 pM. Note also that [Ca2c]csincreases slowly during the active phase reaching a maximum just prior to the termination of the plateau and decreases during the silent period reaching a minimum at the beginning of fast depolarization. This behavior resembles that of [Ca2+Iiin my 1985a model. In this model, Ivlcc acts as if it is a hyperpolarization activated inward current. This is due to the gating variable p , which reaches a maximum during the silent period (since [Ca2+Icsbecomes low) and a minimum during the active period (since [Ca2+lcsbecomes high). In sinus atrial nodal cells, this type of current (known as I h ) plays a central role in the genesis of a pacemaker current (Yanagihara et al. 1990). It should be mentioned, however, that the in the SA nodal cell is a voltage-sensitive current, while our IvIcc is voltage-insensitive.
Modeling Bursting Neurons via Calcium Store
E& W
959
i
Figure 2: Electrical activity (the solid line in the lower trace), the intracellular Ca2+concentration [Ca2+]i(the upper trace), oscillation of the store Ca2+ concentration, [Ca2+Ics(dashes in the lower trace) based on Model 1 presented in Appendix I. The parametric values used for the simulation are listed in Appendix I. In this model, the upstroke of the electrical spike is due to the fast activating m-gate, and the down-stroke is due to the combined effect of k- and n-gates that open slower than the m-gate. During the plateau, intracellular Ca2+ is sequestered into the calcium store, which raises the level of [Ca2+lCs.The termination of the plateau is due to a decrease in IvIcc, which results when [Ca2+lCs reaches a maximum. During the silent phase, [Ca2+lCs decreases slowly, which in turn increases IvIcc. When IVlcc becomes sufficiently large, a new burst can be initiated. In this model, (1)the burst periodicity is controlled by pcs, ( 2 ) the electrical spike height is controlled by both Th and T,,, and (3) the plateau phase is controlled by k,,, i.e., the larger the k,, is the longer the plateau fraction becomes. Figure 3 shows the time course of membrane potential V (solid), [Ca2+lCs (dashes), and [Ca2+Ii(the upper dashed trace) predicted by Model 2 of Appendix 2. Frame A in this figure is based on hypothesis 1 that the voltage-dependent calcium channel consists of both fast and slow channels. Frame B is based on hypothesis 2 that VDCC consists of only a slow channel and a fast channel is a Na+ channel. In the case of hypothesis 1, IvDcc in equation 2.3 is replaced by Ifast+ Islowand k,, is set at 0.45 secc'. In the case of hypothesis 2, IvDc- is replaced by Islow,and k, is changed from 0.45 to 0.36 set-'. This change of k,, was made so that the
Teresa Ree Chay
960
number of spikes remains the same for both hypotheses. Note that both hypotheses give rise to electrical bursting, which has an appearance very similar to that observed in neuronal bursting of Strumwasser (1968) and others (Junge and Stephens 1973; Gorman eta!. 1982). Note also that the shape of electrical bursting is different from that of Model 1 in that the spikes undershoot the slow wave potential, while in Model 1 the spikes remain above the plateau potential (see Fig. 2). The amplitude of the [Ca”.], spike in Frame A is much longer than that in Frame B and the interval between bursts is much shorter for Frame A. The lengthening of [Ca”], amplitude in Frame A reveals the role that Ifastplays in raising the [Ca’+]i level temporarily in the microdomain. The shortening of the burst interval in Frame A is due to both the difference in pump rate and the fact that more calcium enters. One cycle of [Ca2+],,oscillation that occurs in conjunction with electrical bursting and [Ca”], oscillation can be explained as follows. During the silent phase, [Ca2*],, decreases slowly. The decrease of [Ca2+],. is slow since IVlcc allows external Ca” to enter the calcium store. During the same period, [Ca2+],also decreases slowly. This decrease of [CaZ+], is slow since luminal Ca2- is released steadily from the calcium store, which does not allow [Ca”], to decrease fast. As [Ca’+], decreases, the bound Ca2+ ion is slowly released from the receptor site of Islow. Once ica2-: :, reaches the lowest point of 0.51 I‘M, I,{,,,, (together with Ivlcc) gains sufficient strength to bring the potential to the threshold level of -45 mV. This opens the in-gate of IfdSt.The fast electrical upstroke is initiated, and a sudden rise of (Ca”], follows. A fraction of cytosolic Ca” is pumped back into the calcium store during this period of high [Ca2+Ii. Between one spike and the next spike, [Ca7+]cs decreases slightly. This decrease is due to the opening of the Ca”-releasing channel by the CICR mechanism, which releases luminal Ca2 back to the microdomain (see Fig. 1). During the plateau, [Ca2-]c5rises incrementally. When [CaZS],, and [Ca2+],reach certain levels, both IvlcC- and 1,1,, inactivate, and the plateau terminates. The cycle repeats. Note that in Model 2, [Ca”!,, changes between 191 and 196 I‘M, which is much less variation than that in Model 1. This 2.5% variation is not sufficient to elicit bursting alone. On the other hand, [Ca2+],changes over 30%, which in turn induces oscillation in l5l0,, . Thus, it is the combined effect of Islo,,. and IvIc(., which causes bursting in Model 2. As to why the magnitude of [Ca”:,, in two models differ by more than 10-fold, it should be noted that Kvlcc for Model 2 is 10 times greater than that for Model 1 (i.e., 7 vs. 70 pM) for Model 1. The use of two different values for KVlcc is to demonstrate that the essential features of our results are unchanged by KvlcC, and the only significant change is that the greater value of Kv1cc gives rise to a higher level of Using mag-fura-2 AM fluorescent dye (Chatton et d.1995), it will be soon possible to find the luminal Ca” content, and from this content the right value of KvlCc can be extracted. A
Modeling Bursting Neurons via Calcium Store
961
0
B
-4
.........
. :. ;::: ................ ...
. : .
.........................
i........... ;, :, ; i
. / .
::: .... .............. ,. . ,
2% t z v
....
.......
'Ir,
.... Ir,
___.
.......
....
ILJ\
i" .;,[ -
.............
-5
A,
....
N
rd
0
%
Figure 4 reveals how the five currents behave during the bursting shown in Figure 3A. The spike activity is generated by I,,,, (see the bottom trace), which contains a fast activating rn- and somewhat slower inactivating h-gates. The outward delayed-rectifying IK controls the elec-
962
Teresa Ree Chay
trical plateau level, i.e., decreasing g K raises the plateau to such a level that Ifastno longer inactivates (which leads to the depolarized state), and increasing gK lowers the plateau to such a level that Islowno longer activates (which leads to the repolarized state). The two currents, I,~,, and IK,are delicately balanced to control the plateau phase (see the first and second traces). The inward rectifying K+ current I k b together with I V ~ C C controls the silent phase (see the third and fourth traces). That is, the membrane potential does not fall below -62 mV because of activation of Ivrc- during the silent phase. The effect of pcr is studied in Figure 5 using Model 2. Note that pcy has little effect on. the amplitude of electrical spike and levels of silent and plateau phases of electrical bursting. It also exerts little influence on the amplitude of [Ca2+]cs oscillation (result not shown). The amplitude of !Ca2+lispikes, however, decreases drastically with decreasing /icy. When is equal to 1, [Ca’+Ii bursts in parallel to electrical bursting. The amplitude of a [Ca’*], spike is as high as 0.5 LIM(see the top trace). When /icy decreases to 0.1 (see the middle trace), [Ca2’], oscillates with a much shorter spike amplitude. During each successive spiking, the amplitude increases gradually. When pcvis further reduced to 0.01 (see the bottom trace), [Ca”’ji gradually increases during the active phase until it reaches 0.63 I‘M. Compare the bottom trace with the top trace. The electrical burst in the bottom trace has a shorter spike interval and a longer burst interval. Clustering of spikes causes a rise in [Ca2+],,and this in turn induces the burst interval to become longer. Model 1 also gives the result similar to Figure 5 when pc, is varied (the result not shown). Figure 6 shows that the effect of /icyon electrical bursting and [Ca2’],, oscillation. This figure was simulated using Model 2. Note that an nfold increase in pc5 results in almost an rz-fold increase in the frequency of bursting. With decreasing pCs, the number of spikes increases and the burst-to-burst interval also increases. Note also that the amplitude of [Ca”],, increases slightly with increasing /its. This simulation thus shows that the calcium store is actively involved in bursting and the burst periodicity is controlled by /its. Model 1 also gives the result similar to Figure 6 (not shown here). We note that pcs takes a value no greater than 1, but we used the value greater than 1 (the top trace) to demonstrate the effect of pcs on the bursting. The following question arises: How do agents such as caffeine and ryanodine (which can alter the property of the CICR channel) affect bursting? We examine their effect by varying kcicr,and the result is shown in Figure 7. When kcicr is low (i.e., 1.5 sec-’), the membrane potential is at the resting level of -70 mV, [Ca”]i at 0.3 I’M, and [Ca2+]cs at 337 pM. When kcicr is raised to 2.5 sec-’, V is depolarized to -62.7 mV, [Ca2+], rises to 0.45 I‘M, and [Ca2+Icsfalls precipitously to 210 pM (see the top trace). Between k,,,, = 2.7 and 2.9 sec-’, bursting arises (the second and third traces), which eventually leads to continuous spiking when k,,,, is
Modeling Bursting Neurons via Calcium Store
963
0
0
;L
. I
0
I
0
(u (u
I
I 1
]
I
I
I)
I
I
Figure 4: The five currents that participate during bursting in Frame A in Figure 3. The unit of the current is pA/cm2. This figure and the following four figures (i.e., Figs. 4-8) are simulated based on hypothesis 1 of Model 2 (i.e., Frame A of Fig. 3).
Teresa Ree Chay ~
0 3
0
mI
I/ I 0.10
i
I
I? O
0
I L.
0.0
1G
20
30
40
50
60
Figure 5: To show how oCv affects the [Ca”], spike. From the top trace to the bottom, pCv values used are 1.0, 0.1, and 0.01. Here the solid line is the membrane potential, and the dashes the intracelluiar calcium concentration. Note that the pattern of [Ca’-ll oscillations is very different for different pCy values.
raised to 3.0 sec-’ (see the bottom trace). Between the third and fourth traces exists bursting-chaos, which leads to an inverse period doubling sequence (the result not shown). Note in the second and third traces that as k,,,, increases, the frequency of bursting increases and the level of [Ca2+]cs decreases. Why k,,,, modulates the frequency of bursting can be readily understood by examining the dashed curve (i.e., [Ca2+],,).At low k,,,, values (i.e., low agonist concentration), luminal Ca2+is released from the store very slowly. Because of the slow release, [Ca2+IL5 stays at a very
Modeling Bursting Neurons via Calcium Store
965
pcs
= 2.0
I
h
5
v
e 0
. I
c)
E
Y
c
8
0
8
3
V
.3
0
3-2
10
'g
3
cl
4)
.a 3
*
.m 3
0
I
I
0.0
I
I
10
20
I
I
40
30
Time
I
50
1 1 1
60
(s)
Figure 6 To show how pcs controls the period of a burst. From the top trace to the bottom, pcs values used are 2.0, 1.0, and 0.5. Note that the burst period increases proportionally to pcs. Here the solid line is the membrane potential, and the dashes show the luminal calcium concentration. high level. Since [Ca2+]csis so high, it requires a longer time before IVICC activates. At high kcicr value (i.e., high agonist concentration) emptying takes place much faster, and this in turn results in a high frequency of oscillation. It is informative to learn the role of Iv~ccin the electrical activity of bursting neurons. This information is provided by varying gvlcc, and the result obtained using Model 2 is shown in Figure 8. When IVICC carries a weak current (i.e., gvlcc = 3.3 pS/cm2), the cell is in a rest-
Teresa Ree Chay
966
- ,
....-
li.
Figure 7: The effect of varying the release rate of the CICR channel, as modeled by k,,,,. Initially the cell is at rest with kcicr = 1.5 sec-'. At t = 0 k,,,, is increased to 2.5, 2.7, 2.9, and 3.0 from the top to the bottom traces. The increase in kcicr gives rise to repetitive spiking from the resting state via bursting. ing state of -70 mV (result not shown). When I\,,cc gains sufficient strength (i.e.,gvlcc = 3.4 /&/cm2), the cell bursts with a long silent phase between bursts (see the top trace). When g\qcc is increased to 4.0 (second trace), (1) the interval between bursts decreases drastically, ( 2 ) the spikes are more widely separated (although the number of spikes remains the same), (3) the silent and plateau levels become closer, and (4) the [Ca2+jCs level rises. When gvlcc is increased further, the spike interval becomes more widely separated, which leads to chaotic bursting. A further increase in gvlcc initiates an inverse period-doubling sequence. In the fourth trace, doublets in the period-doubling sequence are shown. After passing through this chaotic regime, the system enters a repetitive spiking regime (the bottom trace). The series similar to this was found
Modeling Bursting Neurons via Calcium Store
967
0
r7
0
r-
0
r7
0
P-
[ 2
Figure 8: The role of Ivrcc on controlling the bursting structure. From the top to the bottom trace, gvIcc used for the simulations are 3.4, 4.0, 4.1, 4.15, and 4.2 ps/cm2. Note the appearance of chaos in the third trace, which leads to an inverse period-doubling sequence.
(1) in my 1985a model where gK,C is used as the bifurcation parameter and (2) in my 1990 model where gslOw is used as the bifurcation parameter (Chay et al. 1995). We note that the various modes of oscillation shown in this figure have been observed by Hayashi and Ishizuka (1992) in Onchidium pacemaker neuron, when a biased dc current is used as a bifurcation parameter. How bursting transforms to spiking via chaos can be seen more clearly by constructing a bifurcation diagram. Figure 9 is constructed using Model 1, where gvrccis used as the controlling parameter. The points in this figure were obtained by recording all the [Ca2+]cs’s (for a given gvrcc)
Teresa Ree Chay
968
whenever the upstroke of V passes the line V = -40 mV. In recording these points, the first several tens of cycles were thrown away to include only those in the limit cycle. In this figure, the ordinate displays [Ca2-t]cs, and the abscissa displays gvlcc. The four frames in the bottom are enlarged portions of the chaotic regimes in the diagram: the regions that lie between (1)the three spikes and the four spikes, (2) the four spikes and the five spikes, (3) the five spikes and six spikes, and (4) the six spikes and the repetitive spiking. As in my earlier model (Chay 1985a), a route to chaos starts with a period-adding sequence (i.e., one, two, three, four, five, and six spikes) and ends with an inverse period-doubling route. The four chaotic regimes shown in the bottom are not simply regions of utter chaos, but there are several regular periodic states embedded in each of these chaotic regimes. The right-most chaotic regime (where the six-spike bursting transforms to repetitive spiking) contains very complex structure. As one moves from the left to the right in this region, the route from order into chaos follows the Feigenbaum diagram of the period-doubling scenario (Feigenbaum 1983). Out of repetitive spiking two branches bifurcate (period-2), out of these branches two branches bifurcate again (period-4), and then two branches bifurcate out of each of these again (period-8). We can follow the bifurcating tree up to period-16; afterward spiking-chaos sets in. Crisis transition (Grebogi and Ott 1983) sets in around gv,,~= 1.64, where spiking-chaos suddenly transforms to bursting-chaos. Even within the spiking-chaotic regime, there arises a crisis transition, demonstrating the fractal nature of chaos (see the inset of Frame D). In the bursting-chaotic regime, we see a variety of beautiful structures. There exist six bands resulting from points not being uniformly distributed over this chaotic regime. The system ends with regular bursting after passing through an inverse period-doubling route (where each of six branches inversely bifurcates). A square-wave bursting can arise from Model 1 when the Ah value of 16 secc’ is raised to 19 sec and this is shown in Figure 10. This burst resembles a square-wave burst seen in pancreatic .I-cells in the presence of glucose (Dean and Mathews 1970). Consistent with the model prediction, spectrophotometric measurements of [Ca’+]i reveal a fast increase in [Ca”], that accompanies the upstroke of V and a slowly decaying [Ca2+], during the silent phase (Valdeolmillos et 01. 1989), reaching a minimum just before the upstroke. The obseri2ed [Ca’*], spikes have a much shorter amplitude than those seen in this figure. However, one should remember that the experimental [Ca‘’]l is that observed in a whole islet, which consists of .+cells as well as other cell types. The simulated [&’+Ii, on the other hand, is localized within a single cell (see Fig. 1). In addition, a presently available spectrophotometer is not sensitive enough to detect a fast Ca” spike that occurs in the millisecond time scale. Considering these facts, the shape of the simulated [Ca”], mimics the experiment quite well.
’,
Modeling Bursting Neurons via Calcium Store
969
Figure 9: The role of IVICC on the bifurcation structure using Model 1. The top panel: the bifurcation diagram, which plots [Ca2+]csversus gVIcc, the maximal conductance of VICC. The bottom panel shows the four chaotic regimes in an expanded scale. Frame A reveals a bifurcation structure in the transition zone between the three-spike and four-spike bursting, Frame B reveals that between the four-spike and five-spike bursting, and Frame C reveals that between the five- and six-spike bursting. Frame D shows the bifurcation structure when the six-spike bursting transforms to the repetitive spiking. The inset in the top panel reveals a period-doubling cascade when the six-spikingbursting moves to the right, and the inset in Frame D in the bottom panel reveals a crisis transition that exists in the spiking chaotic regime.
Teresa Ree Chay
970
.-
N i
cd
2
lime
( s i
Figure 10: A square-wave bursting often seen in excitable cells such as pancreatic +cell. The parameters used for simulations are the same as those listed in Appendix I except that Th = 19 is used for the simulation. 4 Discussion
Intracellular Ca2+ions are essential for initiating and maintaining the cellular events, such as secretion of insulin in pancreatic J-cells. According to the two models presented in this paper, the slow depolarization of V is brought by the VICC, which is activated by depletion of the calcium store. When the membrane is sufficiently depolarized, the VDCC activates, which permits external calcium to enter the cell. These events can be seen very clearly in the upper trace of Figure 10. Here, the rapid rise of [Ca”], during the upstroke of V is mainly due to Ca2+ ion entering from VDCC, and a slow exponential drop of [Ca2+],(instead of a drastic drop) during the silent period is due to (1)a slow release of luminal Ca2+ via the CICR channel and (2) extracellular Ca2+ entering from the VICC (which activates slowly during the silent period). In this model, lvlcC becomes a maximum during the silent period and a minimum during the plateau. Thus, it acts as if it is a hyperpolarization activated inward current, such that it is activated during the hyperpolarized phase and inactivated during the depolarized phase. In the past, the C-operated model (where [Ca2+],varies slowly) has been widely used by mathematical modelers. The fundamental assumption in these models is that [Ca’+], is a slow dynamic variable, which gives rise to bursting (as demonstrated in the bottom trace of Fig. 5).
Modeling Bursting Neurons via Calcium Store
971
This does not agree with recent experiments, which show that [Ca2+], is a fast variable. To account for the fast change of [Ca2+Ii,the gateoperated model (where the gating process is a slow dynamic variable) was proposed. The fundamental assumption in the later model is that a gating process takes place in the time scale of tens of seconds. There is some doubt of the validity of this assumption, since experimentally known gating processes take place in the time scale of milliseconds. In the model presented here, it is not [Ca”], but is [Ca2+]cs that is a slow dynamic variable, and there is ample experimental as well as theoretical evidence to support this hypothesis (Kuba and Takeshita 1981; Meyer and Stryer 1988; Goldbeter et al. 1990; Somogyi and Stuki 1991). The present model predicts that although [Ca2+],is a fast dynamic variable, there is a slow component embedded in the process. This slow component owes its origin to slowly releasing luminal calcium from the calcium stores. In Model 1, IV~CC is solely responsible for the bursting via slowly varying [Ca2+ICS, whereas in Model 2 the combined effect of lvlcc and Islowis responsible for bursting. Levitan and Levitan (1988) have reported that a high concentration of serotonin (5-HT) depolarizes the silent phase of the burst cycle and enhances the burst activity. Increasing the concentration of 5-HT eventually leads to repetitive spiking. They suggested that this is due to cyclic AMP (produced by activation of adenine cyclase by serotonin), which enhances the ”subthreshold” Ica. Our results shown in Figure 8 (that enhancing Ivrcc can give rise to repetitive spiking via depolarization of the silent phase) suggest that cAMP may enhance IVICC via phosphorylation of this channel. Another hypothesis, as suggested in Figure 7, is that cAMP may affect the calcium releasing channel in the calcium store. It should be emphasized that both the C- and gate-operated models may be good models for describing the mechanisms involved bursting neurons whose burst period is on the order of a few hundred milliseconds, e.g., thalamocortical (TC) relay neurons, thalamic reticular (RE) neurons, and CA3 hippocampal pyramidal neurons. In the case of TC neurons, the bursting can be described by a low threshold calcium current IT, a hyperpolarization-activated cationic nonselective current ( I h ) , and a few other currents (McCormick and Huguenard 1992). In a recent series of papers (Destexhe et al. 1994 and references therein), TC and RE neurons are modeled by IT together with two Ca2+-activatedcurrents, I K , cand ~ IcAN (a nonspecific cation current activated by intracellular calcium). In the case of CA3 hippocampal pyramidal neurons, the bursting can be described by the gating behavior of the K(AHP) channel, a channel responsible for a long-duration Ca2+-dependentafter hyperpolarization K+ current (Traub et al. 1991). If Ica in my 1985a model is assigned as IT and the gating processes in that model were made 500 times faster, the bursting similar to that observed in TC and RE neurons can be generated. If IK,Ca in my 1986 models are assigned as I K ( A H ~ )and the [Ca”], dynamic
972
Teresa Ree Chay
is made faster, the bursting similar to that obtained by Traub et al. (1991) can also be generated. In summary, I have shown the importance of the role that luminal calcium plays on electrical bursting and [Ca2+Iioscillation. This was achieved by formulating two models: The first model contains a Ca2+ channel that activates and inactivates rapidly upon depolarization and a voltage-independent Ca2+ channel that activates when [Ca2+lCs becomes low. The second model contains (in addition to all the features in Model 1) a voltage-activated Ca" channel that inactivates by [Ca2+]1, 15iobv.Using Model 1, I showed how the slow cyclic variation of [Ca2f]cs generates bursting observed in neurons such as Helix. Using Model 2, I showed that a slow decay of [Ca2+],affects the bursting by activating Ibloir.This slow decay is due to a steady release of luminal Ca2+ through the CICR channel, which does not allow [Ca'+], to decrease drastically between bursting. Using mag-fura-2 fluorescent dyes, Chatton ef al. (1995) found that [Ca2c]csoscillates in hepatocytes. The same technique can provide the luminal-free Ca" content in bursting neurons, and it will be interesting to find whether [Ca2L]cs does oscillate in the time scale of bursting neurons as predicted in this work. By simulating Figure 7, I have suggested how neurotransmitters and hormones control the burst periodicity of excitable cells. This simulation may give answers to why some agonists (e.g., CAMP,glucagon, and somastatin) have the ability to change the burst periodicity. In addition, I showed how [Ca"Ii, [Ca2+IC5, the intracellular calcium store, and the ionic channels interact to give rise to electrical bursting, spiking, and chaos. The model's prediction that the route to chaos is a periodadding sequence has been demonstrated by Hayashi and Ishizuka (1992) in Onclzidiiinz pacemaker neuron, as these workers used a depolarizing biased current as the bifurcation parameter. It will be interesting to use a drug that can alter the conductance of VICC and demonstrate the existence of such a sequence, as well as the importance this channel plays on the bursting and chaotic modes. Appendix I
____.__
Model 1: This model is updated from my three-variable model (Chay 1985a), where the calcium-sensitive K' current is replaced by 1 ~ 1 , ~The . model contains the following four currents: 1vrX-c- = gc,nrqrzcv
-
V,,)
rvlcc
= gvrccpw- vca) K I = gKrz"(v - v,) 11.
= XI.(V - v1.1
where VcJ. VK. VL are the reversal potentials for the respective channels. Since the activation of IV1X-c is very fast, we replaced V I for nz,. Although
Modeling Bursting Neurons via Calcium Store
973
V K and V, may be assumed to be constant, Vca is not due to the changes of [Ca2+Ii.Thus, we expressed it by a Nernst potential of the following form:
where [Ca2+Iext is the external calcium concentration. In this model, we assume that the opening of the VICC brings about external Ca2+into the microdomain and the calcium-pump in the calcium store is activated by intracellular calcium ion: [ca2+]? K,.~ + [ca2+]2
kcsp = kLSP 2
where kLsp is the rate constant independent of [Ca2+I1,and Kcsp is the dissociation constant of the Ca2+ion from the receptor site of the pump. The basic parametric values in the model are as follows: C, = 1pF cmP2%gfast = 600 pS cm-2,gv,cc = 1.5 FS cmP2.gK= 500 $3 cmP2,gL= 6pS cmP2> V K= -75 mV, VL = -60 mV, V, = -25 mV, S , = 9 mV, V , = -47 mV, S h = -7 mV, V, = -18 mV, S, = 14 mV, Ah = 16 sec-',X, = 25 sec-', a h = 0.5, u, = 0, Kvlcc = 7 pM, Kcsp = 0.5 pM, Kcs = 1.0 pM, k,, = 5.0 sec-', k&, = 30 sec-'. k,,,, = 0.3 sec-', 41 = 4 2 = 0.02. 4cs = 0 , pcy = 0.2, pcs = 1, [Ca2+Iext = 2.5 mM, and T = 37°C.
Appendix I1 Model 2: This model is updated from my 1990 model (Chay 1990b), where lL of the 1990 model is replaced by IVICC and Ikb, the inward rectifying time-independent K+ current. Thus, the resting potential in Model 2 is controlled by the two currents, lv~ccand Ikb. There are altogether five currents in this model, and they take the following mathematical expressions: 1fast
=
Islow
=
IVICC
=
ly,
=
1kb
= gkb
[
I)?(
1 - exp
/
[+
1 exp
(y )]
where the gating variablef is inactivated by [Ca2+Ii.It is assumed that the inactivation gating process of Islowtakes place very fast, and this assumption leads to the following form for f: KCa
f=
Kca + \~a2+1i
Teresa Ree Chay
974
where Kc, is the dissociation constant of Ca2+ from the receptor site. If the calcium channel cluster (CCC) contains both L- and N-type channels (hypothesis l),lVKcin equation 2.2 is equated to Ifast+ Islow.If the CCC contains only L-type channels (hypothesis 2), IVKCis equated to Islor,,.If Ifastis mainly carried by the Na+ current, its reversal potential is replaced by VNa (= 60 mV). This substitution affects the results little. As in Model 1, VC, is not constant but takes a Nernst form:
Vc, = RT -In 2F
~
[Ca2+jext [Ca”],
The assumption that the VICC communicates directly with the intracellular store leads to the following form for VV~CC:
VvKc
RT 2F
= -In
~
[CaZ-1ext [Ca2+jcs
This assumption is an oversimplification to the currently accepted model (Randriamampita and Tsien 1993), where the communication between the store and the I,,, current is made via the CIF (Ca2+ influx factor). However, we note that the assumption of the direct communication can be removed without changing our results much ( e g , see Model 1). The basic parametric values in the model are as follows: C, = 1 p F crn-*.gi,,, = 700 /IS cm-’, gsloM= 20 pS cm-*, gvlcc = 4 /IS crn-’, gK = 250 ,is cm-*, g k h = 400 nA cm-2, VK = -80 mV, Vk = -40 mv, V , = -12 mV, S, = 5 mV, Vh = -40 mV, S h = -6 mV, v d = -30 mV, Sd = 10 mV, V , = 15 mV, S, = 15 mV, Ad = 2 sec-’, Ah = 10 set-', A, = 10 sec a h = a d = 0 5, a, = 0, Kc, = 0.5 I‘M, Kvrcc = 70 pM, K,, = 3.0 pM, k,, = 4 5 sec-’, kc,p = 150 secc’, k,,,, = 2 8 sec-I, o1 = 0 01, o1 = 0, 2 i oc5= 0.01, pcv = 1, p,, = 1, [Ca I, = 2 5 mM, and T = 20°C.
’,
Appendix I11 Equation 2.3 in the text can be eliminated by introducing a rapid equilibrium assumption for this process, i.e., d[Ca2’],/dt = 0. This leads to the following equation for [Ca2+],:
where one can use an iteration method to find an accurate value of [Ca”],. In the same equation, pcv measures the content of free calcium in the microdomain, and it can be expressed by the following equation (see my 1990b paper): ,;.I
= ___ d[Ca’+]~=
d[Ca2‘1,
1
+
PI K B (1 +
T)2
Modeling Bursting Neurons via Calcium Store
975
where [ C a 2 + ]is~the total calcium concentration in the microdomain, [B]is the Ca2+binding buffer concentration, and KB is the dissociation constant. In deriving the second equality, we assumed that the buffer can bind only one Ca2+ ion. Thus, pcy takes a value that is always less than unity. In a similar manner, pcs in equation 2.4 measures the content of free luminal calcium in the calcium store, where [B] is the Ca2+ buffer concentration in the store.
Acknowledgments This work was supported by National Science Foundation MCB-9411244 and in part by a grant from the Pittsburgh Supercomputing Center through the NIH Division of Research Resources cooperative agreement U41 RR0415.
References Bokvist, K., Eliasson, L., Ammala, C., Renstrom, E., and Rorsman, I? 1995. Co-localization of L-type Ca2+ channels and insulin-containing secretatory granules and its significance of the initiation of exocytosis in mouse pancreatic B-cells. EMBO ]. 14, 50-57. Canavier, C. C., Clark, J. W., and Byrne, J. H. 1991. Simulation of the bursting activity of neuron R15 in Aplysia: Role of ionic currents, calcium balance and modulatory transmitters. f. Neuropkysiol. 66, 2107-2124. Chatton, J.-Y., Liu, H., and Stucki, J. W. 1995. Simultaneous measurement of Ca2+ in the intracellular stores and the cytosol of hepatocytes during hormone-induced Ca2+ oscillations. FEBS Lett. 368, 165-168. Chay, T. R. 1983. Eyring rate theory in excitable membranes: Application to neuronal oscillations. f. Phys. Ckem. 87,2935-2940. Chay, T. R. 1985a. Chaos in a three-variable excitable cell model. Physica 16D, 23S242. Chay, T. R. 1985b. Glucose response to bursting-spiking pancreatic P-cells by a barrier kinetic model. Bid. Cybern. 52, 339-349. Chay, T. R. 1986. On the effect of intracellular calcium-sensitive K+ channel in the bursting pancreatic @-cell.Biopkys. J. 50, 765-777. Chay, T. R. 1987. The effect of inactivation of calcium channels by intracellular Ca2+ ions in the bursting pancreatic @-cells.Cell Biophys. 11, 77-90. Chay, T. R. 1990a. Bursting excitable cell models by inactivation of Ca2+ currents. ]. Tkeor. Bid. 142, 305-315. Chay, T. R. 1990b. Electrical bursting and intracellular Ca2+ oscillations in excitable cell models. Biol. Cybern. 63, 15-23. Chay, T. R. 1991. Intracellular Ca2+ oscillation and electrical bursting by the membrane ion channels and cellular endoplasmic reticulum in insulin secreting pancreatic @-cells.Biophys. J. 59, 14a. Chay, T. R. 1993a. The mechanism of intracellular Ca2+oscillation and electrical bursting in pancreatic P-cells. Adv. Biophys. 29, 75-103.
976
Teresa Ree Chay
Chay, T. R. 1993b. Modelling for nonlinear dynamical processes in biology. In Potterm, Ir!formtiorz arid CIiaos in Neirroml Systems, B. J. West, ed., pp. 73-122. World Scientific Publishing, River Edge, NJ. Chay, T.R. 1995a. Electrical bursting and Ca" oscillations in Aplysin neurons: The role of two types of Ca'- channels and cellular Ca2+ stores. Nenrosci~ncc~ 21, 23. Chay, T.R. 1995b. Bursting, spiking, and chaos in an excitable cell model: The role of an intracellular calcium store. Proc. NOLTA'95 2, 1049-1052. Chay, T. R., and Cook, D. L. 1988. Endogenous bursting patterns in excitable cells. Motlr. Biosci. 90, 139-153. Chay, T.R., and Fan, Y. S. 1993. Evolution of periodic states and chaos in two types of neuronal model. In C h o s in Biolog/ anrf Medicine, W. L. Ditto, ed., PrOC. S P I E 2036, 100-114. Chay T.R., and Kang, H. S. 1987. Multiple oscillatory states and chaos in the endogenous activity of excitable cells. In Chaos iri Biological Systrms, H. Degn, A. V. Holden, and L. F. Olsen, eds., pp. 173-181. Plenum, New York. Chay, T. R., and Lee, Y. S. 1990. Bursting, beating, and chaos by two functionally distinct inward current inactivations in excitable cells. Arin. N.Y.Acad. Sci. 591, 328-350. Cha\; T.R., Lee, Y. S., and Fan, Y. S. 1995. Bifurcations, chaos, and universality in biology. I n t . 1. BIfirrc. Chaos 5, 595435. Cuthbertson, K. S. R., and Chay, T. R. 1991. Modelling receptor-controlled intracellular calcium oscillators. Cell Calciirni 12, 97-109. Dean, P. M., and Mathews, E. K. 1970. Glucose-induced electrical activity in pancreatic islet cells. 1. Pliysiol. 210, 255-264. Destexhe, A,, Contrearas, D., Sejnowski, T. J., and Steriade, M. 1994. A model of spindle rhythmicity in the isolated thalamic reticular nucleus. 1. Neirrobiol. 72, 803-818. Fabiato, A. 1983. Calcium-induced release of calcium from the cardiac sarcoplasmic reticulum. Am. 1. Pkysiol. 245, Cl-Cl4. Feigenbaum, M. J. 1983. Universality in nonlinear systems. Plzysicn 7D, 16-39. Friel, D. D., and Tsien, R. W. 1992. Phase-dependent contribution from Ca2+ entry and Ca2- release to caffeine-induced [Ca*+]ioscillations in bullfrog sympathetic neurons. Ncirrotis 8, 1109-1125. Friel, D. D., and Tsien, R. W. 1994. An FCCP-sensitive Ca2+ store in bullfrog sympathetic neurons and its participation in stimulus evoked changes in [Ca"],. 1. Nrirrosci. 14(7), 40074024. Goldbeter, A,, Dupont, G., and Berridge, M. J. 1990. Minimal model for signalinduced Ca2' oscillations and for their frequency encoding through protein phosphorylation. Pros. Natl. Acad. Sri. U.S.A. 87, 1461-1465. Chrman, A . L. F., Hermann, A,, and Thomas, M. V. 1982. Ionic requirements for membrane oscillations and their dependence on the calcium concentration in a molluscan pace-maker neurone. 1. Pliysiol. 327, 185-217. Grebogi, C., and Ott, E. 1983. Crisis, sudden changes in chaotic attractors, and transient chaos. Physics 7D, 181-200. Hayashi, H., and Ishizuka, S. 1992. Chaotic nature of bursting discharges in the Orrrhidiitni pacemaker neuron. 1. Theor. Biol. 156, 269-291.
Modeling Bursting Neurons via Calcium Store
977
Hodgkin, A., and Huxley, A. F. 1952. A quantitative description of membrane current and application to conduction and excitation in nerve. J. Physial. (London) 117,500-544. Hoth, M., and Penner, R. 1992. Depletion of intracellular calcium stores activates a calcium current in mast cells. Nature (London) 355, 353. Junge, D., and Stephens, C. L. 1973. Cyclic variation of potassium conductance in a burst-generating neurone in Aplysia. J. Physiol. 235, 155-181. Kramer, R. H., and Zucker, R. S. 1985. Calcium-induced inactivation of calcium current causes the interburst hyperpolarization of Aplysia bursting pacemaker neurones. J. Physiol. 363, 131-160. Kuba, K., and Nishi, S. 1976. Rhythmic hyperpolarizations and depolarization of sympathetic ganglion cells induced by caffeine. J. Neurophysiol. 39, 547563. Kuba, K., and Takeshita, S. 1981. Simulation of intracellular Ca2+ oscillation in a sympathetic neurone. J. Theor. Biol. 93, 1009-1031. Levitan, E. S., and Levitan, I. B. 1988. Serotonin acting via cyclic AMP enhances both the hyperpolarizing and depolarizing phases of bursting pacemaker activity in Aplysia Neuron R15. J. Neurosci. 8(4), 1152-1161. Lipscombe, D., Madison, D. V., Poenie, M., Reuter, H., Tsien, R. Y., and Tsien, R. W. 1988. Spatial distribution of calcium channels and cytosolic calcium transients in growth cones and cell bodies of sympathetic neurons. Proc. Natl. Acad. Sci. U.S.A. 85, 2398-2402. Llano, L., Dressen, J., Kano, M., and Konnerth, A. 1991. Intradendritic release of calcium induced by glutamate in cerebellar Purkinje cells. Neuron 7, 577583. McCormik, D. A., and Huguenard, J. R. 1992. A model of the electrophysiological properties of thalamocortical relay neurons. J. Neurobiol. 68, 1384-1400. Meyer, T., and Stryer, L. 1988. Molecular model for receptor-stimulated calcium spiking. Proc. Natl. Acad. Sci. U.S.A. 85, 5051-5055. Nowycky, M. C., Fox, A. P., and Tsien, R. W. 1985. Three types of neuronal calcium channel with different calcium agonist sensitivity. Nature (London) 316, 440443. Ozaki, H., Stevens, R. J., Blondfield, D. P., Publicover, N. G., and Sanders, K. M. 1991. Simultaneous measurement of membrane potential, cytosolic Ca2+, and tension in intact smooth muscle. Am. J. Physiol. 260 (Cell Physiol. 29), C917-C925. Parekh, A. B., Terlau, H., and Stuhmer, W. 1993. Depletion of InsP3 stores activates a Ca2+ and K+ current by means of a phosphatase and a diffusible messenger. Nature (London) 364, 814-818. Plant, R. E. 1981. Bifurcation and resonance in a model for bursting nerve cells. I. Math. Biol. 11, 15-32. Putney, J. W. 1993. Excitment about calcium signaling in inexcitable cells. Science 262,676-678. Randriamampita, C., and Tsien, R. Y. 1993. Emptying of intracellular Ca2+ stores releases a novel small messenger that stimulates Ca2+ influx. Nature (London) 364, 809-814.
Teresa Ree Chay
978
Rinzel, J., and Lee, Y. S. 1987. Dissection of a model for neuronal parabolic bursting. 1. Math Biol. 25, 653-675. Scholz, K. I?, Cleary, L. J., and Byrne, J. 1989. Inositol 1,4,5-triphosphate alters bursting pacemaker activity in Aplysia neurons: Voltage-clamp analysis of effects on calcium current. 1. Newophysiol. 60, 86-104. Somogyi, R., and Stucki, J. W. 1991. Hormone-induced calcium oscillations in liver cells can be explained by a simple one pool model. 1. B i d . Chrm. 266, 11068-11077.
Strumwasser, F. 1968. Membrane and intracellular mechanism governing endogenous activity in neurons. In Ph!/sialogical and Biochemical Aspects of Neuvans Integration, F. D. Carlson, ed., pp. 329-341. Prentice Hall, Englewood Cliffs, NJ. Traub, R. D., Wong, R. K. S., Miles, R., and Michelson, H. 1991. A model of a CA3 Hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. J. Nertroph!ysioi. 66, 635-650. V’aldeolmillos, M., Santos, R. M., Contreras, D., Soria, EL, and Rosario, L. M. 1989. Glucose-induced oscillations of intracellular Ca2+ concentration resembling bursting electrical activity in single mouse islets of Langerhans. F E B S Lett. 259, 19-23. Wier, W. G., Lopez-Lopez, J. R., Shacklock, P. S., and Balke, C. W. 1995. Calcium signalling in cardiac muscle cells. In Calcizim Waves, Gradients and Oscillations. The Ciba Foundation Symposium 188. G. R. Bok and K. Ackrill, eds., pp. 146-164. John Wiley, New York. Yanagihara, K., Noma, A., and Irisawa, H. 1980. Reconstruction of sino-atrial node pacemaker potential based on the \roltage clamp experiments. Jpn. J. Physiol. 30, 841-857.
Received August 28, 1995; accepted November 21, 1995.
This article has been cited by: 2. L. Niels Cornelisse , Wim J. J. M. Scheenen , Werner J. H. Koopman , Eric W. Roubos , Stan C. A. M. Gielen . 2001. Minimal Model for Intracellular Calcium Oscillations and Electrical Bursting in Melanotrope Cells of Xenopus LaevisMinimal Model for Intracellular Calcium Oscillations and Electrical Bursting in Melanotrope Cells of Xenopus Laevis. Neural Computation 13:1, 113-137. [Abstract] [PDF] [PDF Plus]
Communicated by Laurence Abbott
Type I Membranes, Phase Resetting Curves, and Synchrony Bard Ermentrout Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15260 U S A
Type I membrane oscillators such as the Connor model (Connor et al. 1977) and the Morris-Lecar model (Morris and Lecar 1981) admit very low frequency oscillations near the critical applied current. Hansel et al. (1995) have numerically shown that synchrony is difficult to achieve with these models and that the phase resetting curve is strictly positive. We use singular perturbation methods and averaging to show that this is a general property of Type I membrane models. We show in a limited sense that so called Type I1 resetting occurs with models that obtain rhythmicity via a Hopf bifurcation. We also show the differences between synapses that act rapidly and those that act slowly and derive a canonical form for the phase interactions. 1 Introduction
The behavior of coupled neural oscillators has been the subject of a great deal of recent interest. In general, this behavior is quite difficult to analyze. Most of the results to date are primarily based on simulations of specific models. One of the main questions that is asked is whether two identical oscillators will synchronize if they are coupled or whether they will undergo other types of behavior. In a recent paper, van Vreeswijk et al. (1995) show that the timing of synapses is very important in determining whether, say, excitatory synaptic interactions will lead to synchronous behavior. Hansel et al. (1995) contrast the behavior of weakly coupled neural oscillators for different membrane models. They find substantial differences between standard Hodgkin-Huxley models and Connor models (Connor et al. 1977)which have the additional A current. One easily measurable property of a neural oscillator (either an experimental system or a simulated one) is its phase resetting curve. The phase resetting curve or PRC is found by perturbing the oscillation with a brief depolarizing stimulus at different times in its cycle and measuring the resulting phase-shift from the unperturbed system. By making the perturbation infinitesimally small (in duration and amplitude), it is possible (at least for the simulated system) to derive what is called in Kuramoto (1984) and Hansel ef al. (1993,1995)the responsefunction of the neural oscillator. Thus, Hansel and collaborators show that the response function or infinitesimal PRC for the Connor model is very different from that of Neural Computation 8, 979-1001 (1996) @ 1996 Massachusetts Institute of Technology
980
Bard Ermentrout
the Hodgkin-Huxley model. In particular, they show that perturbations of the Connor model can nez’er delay the onset of a spike, only advance it. That is, the phase resetting curve for the Connor model is nonnegative. In contrast, the PRC of the Hodgkin-Huxley model has both negative and positive regions. They refer to models with strictly positive phase resetting curves as ”Type I” and those for which the phase resetting curve has a negative regime as “Type 11.” The differences in the response of the oscillators to brief stimuli have profound consequences for coupled cells. In particular, Hansel et al. (1995) show that excitatory synapses cannot lead to synchronization for the Connor model. Their result is quite general, in that they explore the consequences of Type I phase resetting on coupling without reference to a particular model. In particular, they show that unless the synapses are very fast, synchrony for excitatory coupling is not possible. Their results are similar to those obtained by van Vreeswijk d nl. for integrate and fire cells and the Hodgkin-Huxley model. The goal of this paper is to demonstrate that the differences between Hodgkin-Huxley type oscillators and the Connor model can be accounted for by looking at the mechanism by which the membranes go from resting to repetitive firing as current is injected. We use a singular perturbation method to derive the response of the membrane to weak inputs such as brief pulses of current and synaptic drive. From these calculations, we derive a ”canonical” form for both the phase resetting curve and the phase interaction function for coupled membrane oscillators that are similar to the Connor model. We use this to compute the stability of synchrony and antiphase activity as a function of the temporal properties of the synapses. In Rinzel and Ermentrout (1989) we review the classification of excitable membranes by Hodgkin (1948) in terms of their dynamics as a current is injected. There are two main types of excitable axons: Type I and Type 11. (Henceforth, to avoid confusion between the classification of membrane excitability and phase resetting curves, we will always say ”Type I PRC” when referring to phase resetting curves and ”Type I membrane” or “Type I excitability” when referring to dynamics of the membrane.) Type I membranes are characterized mainly by the appearance of oscillations with arbitrarily low frequency as current is injected whereas for Type I1 membranes, the onset of repetitive firing is at a nonzero frequency. The Connor model is an example of Type I excitability. In Figure l a , we show the frequency as a function of injected current for the Connor model and for contrast, the current-frequency response for the Hodgkin-Huxley model (a Type I1 membrane) is shown in Figure l b . The repetitive activity first occurs at a nonzero frequency for the Hodgkin-Huxley model (that is the minimum firing rate is greater than zero). The minimal firing rate of the Connor model is zero. The difference between these two models arises in the mechanism by which repetitive firing ensues. In “Type 11” membranes, like the
Type I Membranes
981
Hodgkin-Huxley, the following occurs: For low currents, there is a single equilibrium state and it is asymptotically stable. As the current increases, this state loses stability via a (subcritical)Hopf bifurcation and repetitive firing ensues. By contrast in ”Type I” membranes such as the Connor model, there are three equilibria for currents below the critical current: a low voltage one that is stable ( E ) , a high voltage one that is unstable, and an intermediate voltage equilibrium that is an unstable saddle point (S). The saddle point plays a pivotal role in the onset of repetitive firing. It has one positive eigenvalue and the remaining eigenvalues have negative real parts. There is a pair of trajectories that leave the saddle point (the ”unstable manifold”) and enter the stable fixed point. Together these two trajectories form a loop in the phase-space. This is illustrated schematically in Figure 2. In the phase-space for the equations (six dimensional for the Connor model) there is a closed loop that contains two fixed points: the stable rest point and the unstable saddle point. As the current increases, these two fixed points merge and disappear leaving a stable periodic solution-the repetitive firing. To shed light on the phase resetting function and the behavior of coupled neurons, we will concentrate on parameter values near the critical current for which the two rest states coalesce. The reason for this restriction is that we can explicitly work out the complete nonlinear behavior near criticality. Numerical calculations near criticality are quite difficult for the Connor model due to its high dimension. Furthermore, near the critical current, the Connor model has 5 rather than 3 fixed points (only one of which is stable) and this appears to complicate the application of the present analysis to that model. A simpler model that behaves in much the same way as either the Hodgkin-Huxley equations or the Connor model (depending on the chosen parameters) is the Morris-Lecar model (Morris and Lecar 1981; Rinzel and Ermentrout 1989). The dimensionless equations are
where m,(V)
=
0.5{1+ tanh[(V - V,)/V,]}
+
0.5{1 tanh[(V - V3)/V4]} 1 X(V) = - cosh[(V- V3)/(2V4)] 3 (The values of the parameters are found in the Appendix.) We will use this simpler model to illustrate the asymptotics. In the ”Type I” excitability regime as I varies from a low to a high value, there is a saddle-node bifurcation on the circle leading to sustained slow oscillations.
wm(V)
=
Bard Ermentrout
982
,
1 8 0 - , 160
-
,
,
,
,
,
,
I
...'
Numsrlcal o Analytic -..
_./
..
/-.
..,' a
140 -
.,:* ..,'
., 0
0
0
0
0
..'O
120
-
100
-
,_4
.d
*
;
p
_P
0.'
0.
0
m
80 -
9' ". 9
60 r
7
40 -
n
p'
20 -
0
I
>
53 t
80
-
60 -
40
-
20 -
Figure 1: Frequency as a function of injected current for two different membrane models: (a) Connor and (b) Hodgkin-Huxley. (Note that a formula for the frequency as derived from asymptotics in Section 2 is also shown for the Connor model .)
As we mentioned at the outset of this paper, one of our goals is to characterize the response of neural oscillators to stimuli. In particular, like Hansel ef al. (1995), we will examine the phase resetting curve for different types of membrane oscillators. In Figure 3a and b, we show the infinitesimal PRC or response function for the COMOr model and for the
983
Type I Membranes
1 = 1*
I
I > I*
Figure 2: Saddle-node bifurcation on an invariant circle as the applied current, I varies. (a) For I < I* there is a unique asymptotically stable fixed point and a saddle-point with a one-dimensional unstable manifold whose branches form a closed loop. (b) At criticality, I = I* the node and the saddle coalesce at the point forming a simple loop, X o ( t ) . (Here, X is the vector of coordinates for the single membrane oscillator.) (c) For I > I’ all that remains is a stable limit cycle.
x
Hodgkin-Huxley model. As noted by Hansel et al. (1995), they are quite different. The Connor model is strictly positive and as a consequence brief stimuli can only advance the oscillator. There is a large negative region for the Hodgkin-Huxley model that implies that it is possible to delay the firing of an action potential by an appropriately timed stimulus. Figure 3c shows the same functions for the Morris-Lecar equations in the ”Type I” and “Type 11” excitability regimes. This suggests that the differences lie not in time constants of various currents, but rather in the mechanisms leading to repetitive firing. The PRC of Type I membranes and their behavior with weak coupling is in a sense universal. That is, near the critical currents all of these models have a similar nature if the coupling is small. Ultimately,
984
Bard Ermentrout
2.5
,
,
,
.
,
I
2
1 5
-
t
t
0 5
0
-0 5
0.1
0.3
0.2
=
I
0 1
0
-0 1
-0.2
I
Figure 3: Response functions for membrane oscillators: (a) Connor model. (b) Hodgkin-Huxley model. Confinlied next page.
we have the rather pleasant result that ”Type I membranes have Type I phase resetting curves.” Thus, the remainder of this paper is devoted to showing why this is true and what the consequences are for synaptic coupling of such oscillators.
Type I Membranes
985
25
2
15
-
1
N
05
0
05
0.2
0.6
0.4
0.8
(4
Figure 3: (c) Morris-Lecar model in two different regimes. [Note: In (c) Z(t)
and t have been scaled to the same ranges.] 2 The Solution and Response Function Near the Bifurcation
~
The membrane potential for a typical synaptically coupled membrane model satisfies: dV
C x
= -1ionic
+ I + i s y n S ( t ) ( V s y n - V)
(2.1)
where Iionicare the ionic conductances, I is the applied current, and gsy. is the maximal synaptic conductance. The function s ( t ) is the fraction of open channels due to the firing of a presynaptic neuron. There are numerous ways to model the synaptic conductance. It can satisfy a system of differential equations based on the presynaptic potential (as in Destexhe et al. 1994) or more simply be an "alpha function," s ( t ) = a2tec"'. In the former case, if Vpre(t) is the potential of the presynaptic cell, then s(t) satisfies
where a. /3 are constants and k is a saturating threshold function, such as k(v) = 1/{1 + exp[-(v - vuth,)/v,]}. For v, small, this is like a Heaviside step-function. Recall that for Type I excitable membranes, the onset of repetitive firing occurs when a saddle point and a stable node coalesce (cf. Fig. 2).
Bard Ermentrout
986
At this critical current, there is a trajectory leaving the saddle-node point from one side and entering it from the other. Let X E (V. m. h.. . .) denote the vector of variables for the single membrane oscillator. Let I' be the critical value for the current. Let X be the saddle-node point and let X"(f) denote the closed "infinite period" trajectory containing X. That is
2.1 The Periodic Solution. Before we add synaptic currents, it is easier to first examine a constant current scaled by a small positive parameter, f2. The equations for the voltage are
dV c~-= -I,,,,,, + I= + f 2 i dt
If i > 0 then the phase space will be as in Figure 2c (since I i < 0 then there will be a saddle-node pair as in Figure 2a. More generally, consider the equations: dX ~
11'
t
=
F o ( X ) + F'NIX)
= I*
+ F2i).
If
(2.3)
We assume that when f = 0 there is a saddle-node bifurcation at the point X = x. We write the Taylor expansion for Fo(X1 around this point: F , , ( X ) = A(X - XI + Q t X
-
X . X - Xj + . . .
(2.4)
where A is the Jacobian matrix of Fo evaluated at X and Q is the quadratic term of the Taylor series. By assumption (X = X is a saddle-node point) A is a singular matrix. That is, it has a zero eigenvalue. We will assume (generically) that this zero is simple. Let e be the unit eigenvector of A having zero eigenvalue and let f be the eigenvector of AT satisfying f ' e = 1. Let X(t i = X t tcze T . . where z is a scalar quantity that depends on time. In Ermentrout and Kopell (1986) we show that the dynamics of equation 2.3 are governed by those of z which are (2.5) where
and
q = f Q ( e .e ) Both of these quantities are easily computed for membrane models. In particular, is directly proportional to i while q depends on the details of the membrane model used. If and q have opposite signs then equation 2.5 has a pair of fixed points, one stable and one unstable corresponding to fixed points of equations 2.3 or 2.1. On the other hand, if
Type I Membranes
987
q have the same sign, say, positive, then the lowest order solution to equation 2.5 is
r/ and
where c is an arbitrary constant. Notice that z is "periodic" but that it "blows up" since the tangent function has singularities at odd multiples of r/2. In Ermentrout and Kopell (1986),we show that "blowing up" of this reduced system corresponds to producing a spike for the full system, equation 2.3. Thus, the period of the full equations is
For the membrane model, since, I = I* + t2i, the period of the membrane oscillator is
where C is a constant that depends on the details of the model. This short calculation shows a general property of Type I membranes. The frequency is proportional to the square root of the difference between the applied current and the critical current. [Note. In the paper by Connor et al. (1977), the authors attempt to fit the frequency with a straight line. It is much better fitted with a square-root function, e.g., 1 7 . 5 5 3 J m . See Fig. la.] The "blow-up" of a solution is an indication that equation 2.3 is singularly perturbed as F tends to zero. Thus, in general, one attempts to "match" the solution of equations 2.3 to 2.5 when E + 0. This matching is fairly straightforward so we do not perform the details here. For later use, the solution in terms of normal time is
For the membrane model, r/ = ci where c is a model-dependent constant. Note that as &jet + f r / 2 the solution, equation 2.6, is well defined as we have subtracted the singularity of the tangent function away. (Recall from calculus that lim tan(x) - 1/(7r/2 - x) = 0
x-11/2-
so X ( t ) is defined and periodic for mtt E [-7r/2,r/2].) 2.2 A Convenient Change of Variables and the Response Function. Equation 2.6 gives the full behavior of a single membrane oscillator when perturbed into the regime of repetitive firing. Similarly, the "firing time"
Bard Ermentrout
988
of the membrane oscillator is found from the solution to equation 2.5. To study the effects of coupling and the phase resetting curve for the system, it is convenient to change to a phase equation. In the same manner as Ermentrout and Kopell (1986) we rescale time as r = d and let tan(H/Z)
2 =
Then equation 2 5 becomes dH 11T
= q(1-
(2.7)
This is a form of ”phase equation,” which describes the single oscillator in terms of an angle variable that lies between -T and T.Each time H = T the full model fires a spike. The constant ‘1 is proportional to the applied cicrrmt as well as any syriaptic ciirrent. In particular, suppose a constant positive current is applied, I/ =
ci
where c.9 are both positive. (These constants depend on the details of the model.) Then equation 2.7 is easy to solve by quadrature:
<
where is an arbitrary phase shift. The phase monotonically increases and covers one period in r : 1c9 c 510711 time units. (Recall, the slow time T = F t . ) To compute the phase resetting curve, one just briefly increments the bias current, i. Hansel et a / . (1995) compute the infinitesimal phase resetting curve that arises when the amplitude of the perturbation tends to zero. For the present model, since it is a scalar differential equation, it is easy to compute the infinitesimal PRC (see Appendix B for the calcu lation): A(#) = K(1 + COSH) where K is a constant related to the amplitude of the stimulus. The actual PRC is any phase-shift of equation 2.8; here we have shifted it so that the spike occurs at H = T.Summarizing, we have shown that Every Type I membrane oscillator near the critical ciirrent has a nonegative PRC that is close to equatioi? 2.8 in shape. In particular, the infinitesimal phase resetting curve is nonnegative. In Figure 4a and b we depict the infinitesimal PRC for the MorrisLecar model in the ”Type I” excitability regime near criticality as well as that for the Connor model. Plotted along with these two curves is the formula (equation 2.8) with the appropriate choice of K and the appropriate phase-shift. The PRC computed from equation 2.8 is nearly indistinguishable from that of the Morris-Lecar model. The Connor model is
Type I Membranes
989
qualitatively similar but not as close as the Morris-Lecar model. Similar calculations with other models such as a variant of Hodgkin-Huxley described by Wang (X. J. Wang, personal communication) show a very close fit near criticality. We suspect that the reason that the asymptotics do not agree as well with the Connor model as they do with the MorrisLecar lies in the numerical difficuties of computing the response function near criticality. One can decrement the time step for integration of the equations and in some cases that decreases the error in the numerical computation, but in others it increases it. Finally it is probable that the higher order corrections for the expansion of the response function for the Connor model are much larger than those of the Morris-Lecar. The point of this calculation is that we now have a precise expression for the response function Z ( t ) for any membrane model near the saddle-node bifurcation, namely, equation 2.8. We also have an expression for the membrane potential. We now proceed to the behavior of these oscillators when they are synaptically coupled. 3 Coupled Type I Membranes
The main purpose for computing the response function for an oscillator is so that we can study the behavior of such models when they are coupled together. Thus, we wish to explore the behavior of two ”Type I” membrane oscillators coupled via synaptic conductances. The synaptic current is just another current in our reduction of the full system to the phase model. Thus, the behavior of each oscillator can be expressed in terms of its ”phase” (cf. equation 2.7) and will obey (3.1)
where the synaptic current is
V(t) is the postsynaptic potential and from equation 2.6 satisfies
where Vl is the V component of the eigenvector e . Near criticality, V(t) spends most of its time near the rest state, V . Therefore V ( T / ~) Vsyn = V - Vsyn = -Veff, which will be positive for excitatory coupling and negative for inhibitory coupling. Thus, a pair of weakly synaptically coupled neurons near criticality satisfies ~
(3.4)
Bard Ermentrout
990
2.5
full -
asymptotic ----
1 5 \
,'\
P
1 i
I
/ -0.5
20
80
60
40
100
I
(4 25
""merlcal -
a s p PtOIlC 2 -
-=
I (
15 1 -
05
\;
/
/'
-
0 -
05
0
10
20
30
40
50
60
70
80
90
100
Figure 4: Comparison of infinitesimal response functions with the formula (equation 2.8) for (a) Morris-Lecar model when I = 0.0695 and (b) Connor model when I = 8.5 for j . k = 1.2. Note that this represents a new type of phase reduction model for weakly coupled nonlinear neural oscillators. The phases do not occur as differences and so the analysis is far more difficult. Furthermore, unlike phase difference models, these phase models can have equilibria corresponding to stable equilibria for the full equations (e.g., by choosing '1, < 0.). In a sense, this is a generalization of the product coupling
Type I Membranes
991
models proposed by Winfree (1967) and analyzed by Ermentrout and Kopell (1991), which have the form:
Here R(0) plays the role of the response function and P ( 0 ) is the pulsatile synapse. That is, if the synapses are instantaneous functions of the presynaptic voltage (or phase) then this simple product model obtains. All that remains is to describe the dynamics of the synaptic conductances, s k ( 7 ) . There are several ways in which we can do this. If we assume that the conductances obey some type of ordinary differential equation such as equation 2.2 that is determined by the presynaptic potential, then we must convert these to equations that depend on the phase variables, 0,. An easier approach is to assume that the synaptic conductances are ”alpha functions” of the form:
or the more general form
Here r* is the time of the presynaptic spike. Thus, we must now relate the time of a spike to the phase, 0,. Recall from Section 2 that ”firing” is equivalent to the phase variable 0 crossing T.Thus, an obvious strategy is to let T* denote the time at which the presynaptic phase crosses T. However, we will generalize this to take into account certain facts about the timing of synapses. In particular, if the spike of an action potential is wide (as is the case in some relaxation-like models) then the time at which the threshold for a synaptic event is crossed and the time at which the presynaptic cell reaches its maximum can be quite different. Since the ”maximum” potential corresponds to 6’ = T we will define 7 * to be the time at which 0 = T - 6. Here 6 is a parameter to account for the possibility that the synaptic conductance begins before the presynaptic voltage reaches its maximal value. We have now reduced a pair of synaptically coupled Type I membranes to a pair of phase models coupled through a synaptic conductance (equation 3.4). This type of coupling is difficult to analyze but has a number of modeling advantages over phase models that are derived by strict averaging. In particular, a phase difference model presumes spontaneous oscillation of all the coupled cells in absence of coupling. In equation 3.4 if 7 1 / q < 0 then each uncoupled cell is incapable of spontaneously oscillating and instead has a rest state, cos O = [1+~ / / q ] / [ l - ~ / 4 ] . If coupled excitatorily and if the synapses persist for sufficiently long, the coupled system can produce rhythmic output in addition to remaining at rest. This is an example of coupling induced bistability. If both cells are started near rest, then they will return to rest. If one of them is excited
Bard Ermentrout
992
theta2
1
0
2
3 theta1
4
5
6
Figure 5: Two solutions to equation 3.4 showing return to rest (large cross) and spontaneous out-of-phase oscillation. Parameters are r/ = -0.125.9 = l.g,k = 3. h = 0.15. .i= 1.( I = 10. past threshold, then that can cause the other cell to fire. If the synapse of cell 2 persists long enough for cell 1 to return from its refractory period (approximately the time it takes for cell 1 to go from ;iback to close to its rest state) then this will cause cell 1 to fire again and start the process over again. Figure 5 shows a picture on the HI - H2 torus of solutions to equation 3.4 in which (1) both cells are started at the same value above threshold and ( 2 ) cell 2 starts at rest and cell 1 is above threshold. In the latter case, a spontaneous out-of-phase oscillation develops whereas, in the former case, both cells return to rest. Now suppose that both cells spontaneously oscillate so that ri/q > 0 and that both are identical. In this case, we can make a change of coordinates that reduces equation 3.4 to the equation lifi, -
dT
=
1 -c (1+ cosf?,) Lqs*(T)vk.t,]
(3.7)
Type I Membranes
993
We can now compute a firing map function in the manner of van Vreeswijk et af. (1995) for this coupling. That is, we want to find periodic phase-locked solutions to equation 3.7. We will derive a function that indicates the possible phase-locked solutions and their stability. This is possible because the synaptic conductances are determined solely by the times of the spikes of the presynaptic cell and not by the value of the presynaptic phase at any other time. Let P denote the period of the phase-locked solution. Let S(7-) denote the synaptic time course during one cycle of the periodic solution. Suppose that cell 1 fires at T = 0 and cell 2 fires at 7 = 4P where 4 E [0,1) is the relative phase. Then, cell 1 satisfies: l+gS(r-~P)(l+cos81)V,ff (3.8) dr with &(O) = a. Note that S ( T ) is a known function due to the fact that once a presynaptic cell fires the time course of the synaptic response is completely determined. Call the solution to equation 3.8, O(T;4 ) . Since the period is defined as the value of 7 at which 91 traverses 2a we must have O(P;4)= 3a.(Recall that 91 starts at a.)This gives us an equation for P as a function of 4, P = Q(4). A similar argument used for cell 2 shows that its period must satisfy P = Q(-q5). Since the periods for a phaselocked solution must be the same, we must have G(4) = Q(4)- Q(-4) = 0, which determines the possible phase-locked solutions. Two immediate solutions to G ( 4 ) = 0 are 4 = 0, synchrony, and 4 = 1/2, which is the antiphase solution. A necessary and sufficient condition for stability of a phase-locked solution, is that Q ( 4 ) > 0 and equivalently since G is twice the odd part of Q that G’(4) > 0. To see this, suppose that cell 2 fires a bit later than Then the period of the first cell must lengthen in order to allow the second cell to catch up. That is, Q ( 4 ) > 0. For the integrate and fire model analyzed by van Vreeswijk et al. the function Q ( $ )can be explicitly determined. For the present model, we must resort to numerical solutions. Figure 6 shows the firing map G ( $ ) for g = V,, = 1 , 6 = 0, CI = 0 = a as a function of a the rate of the synapse. It appears that no matter what the rate of the synapse, the synchronous solution is unstable and the antiphase oscillation is stable. The opposite occurs for inhibition (V,ff < 0). Hansel et al. (1995) as well as van Vreeswijk et al. (1995) found that for sufficiently fast excitation, synchrony became stable. We can obtain a similar result in this model if we take into account the finite width of the action potential and thus set 6 to some small positive value. Figure 7 shows the firing map for both excitatory and inhibitory coupling as a function of the synaptic rate. Consider excitatory coupling. As the synaptic rate increases, the synchronous state stabilizes and there is bistability between the synchronous and the antiphase solution. For very fast synapses, the antiphase solution becomes unstable and synchrony is the only stable state. A similar scenario occurs with inhibitory coupling, however, in this case, fast synaptic interactions result in the stability of the dB1 -=
4
4.
Bard Ermentrout
994
-0
1-
-0
I-
- a=20
a=12
a=6 0 5-
0 t?
-
0
I
1
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
phi
Figure 6 : The firing map Gi C J ) for excitatory coupling, g = V e f f= 1. (I = 0. (I = ,j = 17 for various values of 17 the rate of the synapse. Only the out-of-phase solution is stable. antiphase state. One should not place too much emphasis on the details of the bifurcation structure as this is quite dependent on the choice of the synaptic gating function s(f ) . Thus, the picture for inhibitory coupling in Figure 7 is the same as that in van Vreeswijk et al. but for excitatory coupling our diagram is different. For the excitatory integrate and fire model analyzed by van Vreeswijk et al. and for the weakly excitatory coupled Connor model considered by Hansel ef a!. (1995) the transition from stable antiphase to stable synchrony occurs via an intermediate regime where neither is stable. Instead there is a stable nonzero phaselag between the oscillators. 3.1 Some Comments on Averaging. The averaging method can be applied to a pair of oscillators as long as the coupling is very weak compared to the period of the oscillation. Since the oscillators we are
Type I Membranes
995
0
o,4.
0
0.05
0.1
0.15
0.2
025 phi
0.3
0.35
0.1
0.15
0.45
0.5
' -~ ' 6 ' ...~~ *=a
INHIBITORY (G=-l)
0.0s
0.4
.=4
0.2
0.25
0.3
0.35
0.4
0.45
0.5
phi
(b)
Figure 7: The firing map for excitatory and inhibitory coupling when the synapse starts slightly before the peak of the spike (6 = 0.05). (a) Excitatory coupling (Vee = 1); (b) inhibitory (Ve.f = -1). All other parameters as in Figure 6. describing here have very long periods, the coupling must be extremely weak to rigorously apply averaging. Nevertheless, we can get some useful insights into the the global picture for Type I membranes by looking at the averaged equations. Let g be small in equation 3.4 and let Z ( 0 ) = (1+cosQ) denote the response function to lowest order. Then one
Bard Ermentrout
996
can average the phase equations and we obtain the following equation for the phases, H, 1 + gH(H2 - NI )
(3.9) (3.10)
where
/-
1 ’(3.11) H ( H ) = vet,Z ( t i s ( t + 0 ) dt 2;; 0 Letting 0 = H2 - 0 1 ,we can subtract equation 3.9 from equation 3.10 and obtain
The zeros of Tic)) are the phase-locked solutions and they are stable as long as r’(o ) < 0 or equivalently, H’(o ) > 0. Note that r is proportional to the odd part of the function H. Now, for our lowest order model, Zjt) = 1t c o s f so that H must be of the form H(H) = a0 +a, cosH + b, sin0 and thus T(o)= -2bl sino. The first thing to note is that there are only two possible phase-shifts (zeros of r) and they are o = 0 and o = T . Thus, no matter what s ( t ) is, the solutions to the averaged equations have no other possible phase-lags. The reason that Hansel and others have found these intermediate phase-lags in their averaged equations is that the response function Z(H) has higher Fourier components. Indeed, our response function, 1 + cos H is only the lowest order term in asymptotic expansion; higher order terms will generally contain modes such as cos 20. The coefficients of these higher orders determine the nature of the transition from synchrony to antiphase as the synaptic rate varies. A detailed analysis of this transition is given in van Vreeswijk et al. (1995). 4 Discussion and Conclusions
The calculations in the previous section suggest the general picture for a pair of weakly coupled Type I membrane oscillators as a function of the persistence of excitatory synapses and inhibitory synapses. For slow enough synapses and excitatory interactions, the synchronous solution is unstable and the antiphase solution is stable. Under some circumstances, as the synapses speed up, the synchronous solution stabilizes and the antiphase solution loses stability. The reverse is true for inhibitory interactions. The mechanism for change of stability (if it occurs) depends on higher order details of the model. We have shown that neural oscillators that arise from Type I excitability (that is via a saddle-node bifurcation on a circle rather than via a Hopf bifurcation) have Type I or nonnegative phase resetting curves. Thus, we
Type I Membranes
997
have connected the local oscillator mechanisms to the behavior of such oscillators when connected to others. A natural question is whether oscillations that arise through Hopf bifurcations have Type I1 phase-resetting properties, that is, is the phase resetting curve positive for some phases and negative for others (cf. Fig. 3b and c). For a supercritical Hopf bifurcation (that is, a stable small amplitude oscillation emanating from a fixed point) this question is easy to answer. Near the bifurcation point, the oscillation is sinusoidal and the adjoint is easily computed (see Ermentrout and Kopell 1984) to be of the form
Z ( t ) = Z , cos wt + Z, sin wf where Z,,Z, are constant vectors that depend on the properties of the membrane and w is the natural frequency. Thus, we can say that for membranes that undergo a supercritical Hopf bifurcation, the phase response function is sinusoidal and is therefore a Type I1 phase resetting curve. (Many membranes can be put into this regime at high enough temperatures, but it is normally unusual.) Appendix A Numerical Simulation Parameters The models used in this paper are the Hodgkin-Huxley model, the Connor model, and the dimensionless Morris-Lecar equations. All equations were integrated using a fourth-order Runge-Kutta method on the software XPP. Frequency plots were computed using a modified version of AUTO incorporated into XPP. The response functions and phase interaction functions were also computed using XPP. All simulation code is available from the author at [email protected]. The Hodgkin-Huxley equations used here are dV Cdt dm dt
=
I - 120mh3(V - 50) - 36n4(V + 77) - 0.3(V + 54.4)
=
cum(V)(l- m ) -
where am(V) = 0.1 x (V+40)/[1-exp(-(V+40)/10)]
a(V)
=
cyh(V)
=
+ 65)/18] 0.07 x exp[-(V + 65)/20]
4 x exp[-(V
Bard Ermentrout
998
h(V)
=
li(1+exp[-(V+35)/10]}
tr,,(V)
=
0 01 x ( V + 55)/{1 - exp[-(V
,jf1( V)
=
0 125 x exp[-(V
+ 55)/10]}
+ 65)/801
For the simulations in this paper, I = 12. The version of the Connor model we use here is the same as that used by Hansel rt a]. [except for a, ( V ) which is incorrectly defined in their paper]. dV df
C-
=
I - l20nlh3(V - 55) - 20n4(V +
+ 72) - 0
3 V + 17)
47.7a7b(V + 75)
where 0.1 x ( V + 29.7)/{1 - e x p [ - ( V i 29.7)/10]}
+ 54.7)/18] 0.07 x exp[-( V + 48)/20] 1/{1+ exp[-(V + 18)/10]} 4 x exp[-( V
0.01 x ( V + 46.7)/{1 - exp[-(V 0 125 x exp[-(V
+ 46.7)/10]}
+ 56.7)/80]
+
(0.0761 x exp[(V 94.22)/31.84] /{1
+
expi( V + 1.17)/28.93]}}'07333'
0.3632 + 1.158/{1 + exp[(V
+ 55.96)/20.12]}
+ 1.24 + 2.678/{1 + exp[(V+ 50)/16.027]} 1/{1 + exp[(V 53.3)/14.54]}4
For the simulations in this paper, 1 = 8.5
Type I Membranes
999
The dimensionless Morris-Lecar equations are given in Section 1. The parameters used for “Type I” excitability dynamics are 81 = 0.5,gK = 2.g~- = 1.33. Vl = -O.Ol,Vz = 0.15,V3 = O.l,V4 = 0.145, Vc, = 1,VK = -0.7, V, = -0.5 with I = 0.0695. The parameters for “Type 11” dynamics are as above with the exception of gca = 1.1,V, = 0.0, V4 = 0.3,$ = 0.2, and 1 = 0.25. Appendix B: Computation of the Response Function Scalar Phase Model. We compute the response function for a scalar phase model of the form
where S ( t ) is the stimulus inducing the phase-shift. Let & ( t ) satisfy
Since this models a phase oscillator, BO monotonically increases (or decreases) in time sincef is strictly positive (negative). Thus, we can introduce a new phase-variable, $ defined by, 0 ( t ) = &($I). li, satisfies
d.4’1 = 1 + g!Bo(dljlS(t) = 1 + Z($)S(f) dt 8ddJ) The expression multiplying S ( t ) is the response function. For our dynamics, f ( 0 ) = 9( 1 - cos 0) ci( 1 cos 0) and g ( 0 ) = (1 cos 8) where ci > 0. It is then an elementary application of trigonometric identities to see that Z ( 8 ) = K(1+ cos 0) as required.
+
+
+
Numerical Computation of the Response Function. It can be shown (Ermentrout and Kopell 1991) that the response function, Z(t) is the adjoint eigenfunction for the linearization of the differential equations about the stable limit cycle. That is, let X ( t ) satisfy
dX -= F(X) dt and suppose that X(t) is orbitally stable. The adjoint to the linearization satisfies = -DFIX(t)lTZ(t) dt where D F [ X ( t ) ]is the Jacobi matrix of F evaluated at X ( t ) and AT denotes the transpose of A. Since X ( t ) is orbitally stable, then integration in forward time of the linearized equations will always relax to a periodic orbit. Thus, to find the periodic solution to equation A.l, we start with random initial conditions and integrate backward in time over severai pe-
1000
Bard Ermentrout
riods. This will relax to the adjoint, Z ( t ) which is then normalized so that
Ti
T Z ( f )‘ X ’ ( t ) dt = 1
This method of computing the adjoint was suggested to the author by Graham Bowtell. The author’s software package includes this algorithm so that all these calculations are readily automated. Error is introduced in the numerical calculations in t w o ways. The computation of the Jacobi matrix is done numerically so for sharp spikes error can be introduced. Integration of the linear equations also produces the usual numerical errors.
Acknowledgments Supported in part by NSF DMS-93-03706 and by NIMH-47150.
References Connor, J. A,, Walter, D., and McKown, R. 1977. Neural repetitive firing: Modification of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. 1. 18, 81-102. Destexhe, A,, Mainen, Z., and Sejnowski, T. 1994. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formulation. 1. Conzpirt. Ncwrosci. 1, 195-230. Ermentrout, G. B., and Kopell, N. 1984. Frequency plateaus in a chain of weakly coupled oscillators I. SlAM J. Moth. And. 15, 215-237. Ermentrout, G. B., and Kopell, N. 1986. Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Moth. 46, 233-253. Ermentrout, G. B., and Kopell, N. 1991. Multiple pulse interactions and averaging in systems of coupled neural oscillators ]. Moth. Bio. 29, 195-217. Hansel, D., Mato, G., and Meunier, C. 1993. Phase reduction in neural modeling. In Firiictiorinl Afznlysis of tkc Broiiz Bnsed otz Multiple-Site Recordiizgs, October 1992. Coizcepts Nrzrrosci. 4, 192-210. Hansel, D., Mato, G., and Meunier, C. 1995. Synchrony in excitatory neural networks. Nruml Cottip. 7, 307-335. Hodgkin, A. L. 1948. The local electric changes associated with repetitive action in a non-modulated axon. 1. Physiol. (Loridon) 117, 500-544. Kuramoto, Y. 1984. Ckt7nzicn/Oscillofioizs, Wows. mid Tirrhleizce. Springer-Verlag, New York. Morris, C., and Lecar, H. 1981. Voltage oscillations in the barnacle giant muscle fiber. Biophys. J. 35, 193-213. Nayfeh, A. 1973. Pertirrbotiotz Methods. John Wiley, New York. Rinzel, J. R., and Ermentrout, G. B. 1989. Analysis of neural excitability and oscillations. In Methods ztz Neiiroml Modeliiig, C. Koch and I. Segev, eds., pp. 135-169. MIT Press, Cambridge, MA.
Type I Membranes
1001
van Vreeswijk, C., Abbott, L. F., and Ermentrout, G. B. 1995. When inhibition, not excitation synchronizes neural firing. J. Cornput. Neurosci. 1,313-322. Wang, X. J., Golomb, D., and Rinzel, J. 1995. Emergent spindle oscillations and intermittent burst firing in a thalamic model: Specific neuronal mechanisms. Proc. Natl. Acad. Sci. U.S.A. 92, 5577-5581. Winfree, A. T. 1967. Biological rhythms and the behavior of populations of coupled oscillators. J. Theor. Biol. 16, 1542.
Received July 14, 1995; accepted December 21, 1995.
This article has been cited by: 2. Lakshmi Chandrasekaran, Srisairam Achuthan, Carmen C. Canavier. 2010. Stability of two cluster solutions in pulse coupled networks of neural oscillators. Journal of Computational Neuroscience . [CrossRef] 3. R. M. Smeal, G. B. Ermentrout, J. A. White. 2010. Phase-response curves and synchronized neural networks. Philosophical Transactions of the Royal Society B: Biological Sciences 365:1551, 2407-2422. [CrossRef] 4. Srisairam Achuthan, Robert J. Butera, Carmen C. Canavier. 2010. Synaptic and intrinsic determinants of the phase resetting curve for weak coupling. Journal of Computational Neuroscience . [CrossRef] 5. Cheng Ly, Tamar Melman, Alison L. Barth, G. Bard Ermentrout. 2010. Phase-resetting curve determines how BK currents affect neuronal firing. Journal of Computational Neuroscience . [CrossRef] 6. Daisuke Takeshita, Renato Feres. 2010. Higher order approximation of isochrons. Nonlinearity 23:6, 1303-1323. [CrossRef] 7. Ken Nagai, Hiroshi Kori. 2010. Noise-induced synchronization of a large population of globally coupled nonidentical oscillators. Physical Review E 81:6. . [CrossRef] 8. Takashi Kanamaru, Kazuyuki Aihara. 2010. Roles of Inhibitory Neurons in Rewiring-Induced Synchronization in Pulse-Coupled Neural NetworksRoles of Inhibitory Neurons in Rewiring-Induced Synchronization in Pulse-Coupled Neural Networks. Neural Computation 22:5, 1383-1398. [Abstract] [Full Text] [PDF] [PDF Plus] 9. Sachin S. Talathi, Dong-Uk Hwang, Paul R. Carney, William L. Ditto. 2010. Synchrony with shunting inhibition in a feedforward inhibitory network. Journal of Computational Neuroscience 28:2, 305-321. [CrossRef] 10. N. Malik, B. Ashok, J. Balakrishnan. 2010. Noise-induced synchronization in bidirectionally coupled type-I neurons. The European Physical Journal B . [CrossRef] 11. Klaus M. Stiefel, Jean-Marc Fellous, Peter J. Thomas, Terrence J. Sejnowski. 2010. Intrinsic subthreshold oscillations extend the influence of inhibitory synaptic inputs on cortical pyramidal neurons. European Journal of Neuroscience 31:6, 1019-1026. [CrossRef] 12. Nishant Malik, B. Ashok, J. Balakrishnan. 2010. Complete synchronization in coupled type-I neurons. Pramana 74:2, 189-205. [CrossRef] 13. Hiroya Nakao, Jun-nosuke Teramae, Denis S. Goldobin, Yoshiki Kuramoto. 2010. Effective long-time phase dynamics of limit-cycle oscillators driven by weak colored noise. Chaos: An Interdisciplinary Journal of Nonlinear Science 20:3, 033126. [CrossRef]
14. Cheng Ly, G. Bard Ermentrout. 2010. Coupling regularizes individual units in noisy populations. Physical Review E 81:1. . [CrossRef] 15. Cheng Ly, G. Bard Ermentrout. 2010. Analysis of Recurrent Networks of Pulse-Coupled Noisy Neural Oscillators. SIAM Journal on Applied Dynamical Systems 9:1, 113. [CrossRef] 16. E. M. Izhikevich. 2010. Hybrid spiking models. Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences 368:1930, 5061. [CrossRef] 17. Christoph Kirst, Marc Timme. 2010. Partial Reset in Pulse-Coupled Oscillators. SIAM Journal on Applied Mathematics 70:7, 2119. [CrossRef] 18. Shigefumi Hata, Takeaki Shimokawa, Kensuke Arai, Hiroya Nakao. 2010. Synchronization of uncoupled oscillators by common gamma impulses: From phase locking to noise-induced synchronization. Physical Review E 82:3, 036206. [CrossRef] 19. William Erik Sherwood, John Guckenheimer. 2010. Dissecting the Phase Response of a Model Bursting Neuron. SIAM Journal on Applied Dynamical Systems 9:3, 659. [CrossRef] 20. Andrea K. Barreiro, Eric Shea-Brown, Evan L. Thilo. 2010. Time scales of spike-train correlation for neural oscillators with common drive. Physical Review E 81:1. . [CrossRef] 21. Kevin K. Lin, Eric Shea-Brown, Lai-Sang Young. 2009. Reliability of Coupled Oscillators. Journal of Nonlinear Science 19:5, 497-545. [CrossRef] 22. Rafael Vilela, Benjamin Lindner. 2009. Comparative study of different integrate-and-fire neurons: Spontaneous activity, dynamical response, and stimulus-induced correlation. Physical Review E 80:3. . [CrossRef] 23. Gang Zheng, Arnaud Tonnelier. 2009. Chaotic solutions in the quadratic integrate-and-fire neuron with adaptation. Cognitive Neurodynamics 3:3, 197-204. [CrossRef] 24. Kevin K. Lin, Eric Shea-Brown, Lai-Sang Young. 2009. Spike-time reliability of layered neural oscillator networks. Journal of Computational Neuroscience 27:1, 135-160. [CrossRef] 25. Robert D. Stewart, Wyeth Bair. 2009. Spiking neural network simulation: numerical integration with the Parker-Sochacki method. Journal of Computational Neuroscience 27:1, 115-133. [CrossRef] 26. Kaiichiro Ota, Masaki Nomura, Toshio Aoyagi. 2009. Weighted Spike-Triggered Average of a Fluctuating Stimulus Yielding the Phase Response Curve. Physical Review Letters 103:2. . [CrossRef] 27. Thomas Voegtlin. 2009. Adaptive Synchronization of Activities in a Recurrent NetworkAdaptive Synchronization of Activities in a Recurrent Network. Neural Computation 21:6, 1749-1775. [Abstract] [Full Text] [PDF] [PDF Plus]
28. G. Zheng, A. Tonnelier, D. Martinez. 2009. Voltage-stepping schemes for the simulation of spiking neural networks. Journal of Computational Neuroscience 26:3, 409-423. [CrossRef] 29. Cheng Ly, G. Bard Ermentrout. 2009. Synchronization dynamics of two coupled neural oscillators receiving shared and unshared noisy stimuli. Journal of Computational Neuroscience 26:3, 425-443. [CrossRef] 30. Jun-nosuke Teramae, G. Bard Ermentrout. 2009. Stochastic Phase Reduction for a General Class of Noisy Limit Cycle Oscillators. Physical Review Letters 102:19. . [CrossRef] 31. Klaus M. Stiefel, Boris S. Gutkin, Terrence J. Sejnowski. 2009. The effects of cholinergic neuromodulation on neuronal phase-response curves of modeled cortical neurons. Journal of Computational Neuroscience 26:2, 289-301. [CrossRef] 32. Keisuke Ota, Toshiaki Omori, Toru Aonishi. 2009. MAP estimation algorithm for phase response curves based on analysis of the observation process. Journal of Computational Neuroscience 26:2, 185-202. [CrossRef] 33. Myongkeun Oh, Victor Matveev. 2009. Loss of phase-locking in non-weakly coupled inhibitory networks of type-I model neurons. Journal of Computational Neuroscience 26:2, 303-320. [CrossRef] 34. Naoki Asakawa, Yasushi Hotta, Teruo Kanki, Tomoji Kawai, Hitoshi Tabata. 2009. Noise-driven attractor switching device. Physical Review E 79:2. . [CrossRef] 35. Christoph Kirst, Theo Geisel, Marc Timme. 2009. Sequential Desynchronization in Networks of Spiking Neurons with Partial Reset. Physical Review Letters 102:6. . [CrossRef] 36. Santiago Gil, Alexander Mikhailov. 2009. Networks on the edge of chaos: Global feedback control of turbulence in oscillator networks. Physical Review E 79:2. . [CrossRef] 37. Sam McKennoch, Thomas Voegtlin, Linda Bushnell. 2009. Spike-Timing Error Backpropagation in Theta Neuron NetworksSpike-Timing Error Backpropagation in Theta Neuron Networks. Neural Computation 21:1, 9-45. [Abstract] [Full Text] [PDF] [PDF Plus] 38. Tae-Wook Ko, G. Ermentrout. 2009. Phase-response curves of coupled oscillators. Physical Review E 79:1. . [CrossRef] 39. Antoni Guillamon, Gemma Huguet. 2009. A Computational and Geometric Approach to Phase Resetting Curves and Surfaces. SIAM Journal on Applied Dynamical Systems 8:3, 1005. [CrossRef] 40. Stanislav M. Mintchev, Lai-Sang Young. 2009. Self-organization in predominantly feedforward oscillator chains. Chaos: An Interdisciplinary Journal of Nonlinear Science 19:4, 043131. [CrossRef] 41. Jun-nosuke Teramae, Tomoki Fukai. 2008. Temporal Precision of Spike Response to Fluctuating Input in Pulse-Coupled Networks of Oscillating Neurons. Physical Review Letters 101:24. . [CrossRef]
42. R. Vicente, L. L. Gollo, C. R. Mirasso, I. Fischer, G. Pipa. 2008. Dynamical relaying can yield zero time lag neuronal synchrony despite long conduction delays. Proceedings of the National Academy of Sciences 105:44, 17157-17162. [CrossRef] 43. Germán Mato, Inés Samengo. 2008. Type I and Type II Neuron Models Are Selectively Driven by Differential Stimulus FeaturesType I and Type II Neuron Models Are Selectively Driven by Differential Stimulus Features. Neural Computation 20:10, 2418-2440. [Abstract] [PDF] [PDF Plus] 44. Péter L. Várkonyi, Tim Kiemel, Kathleen Hoffman, Avis H. Cohen, Philip Holmes. 2008. On the derivation and tuning of phase oscillator models for lamprey central pattern generators. Journal of Computational Neuroscience 25:2, 245-261. [CrossRef] 45. Sachin S. Talathi, Dong-Uk Hwang, William L. Ditto. 2008. Spike timing dependent plasticity promotes synchrony of inhibitory networks in the presence of heterogeneity. Journal of Computational Neuroscience 25:2, 262-281. [CrossRef] 46. Sachin Talathi, Henry Abarbanel, William Ditto. 2008. Temporal spike pattern learning. Physical Review E 78:3. . [CrossRef] 47. Takashi Kanamaru, Kazuyuki Aihara. 2008. Stochastic Synchrony of Chaos in a Pulse-Coupled Neural Network with Both Chemical and Electrical Synapses Among Inhibitory NeuronsStochastic Synchrony of Chaos in a Pulse-Coupled Neural Network with Both Chemical and Electrical Synapses Among Inhibitory Neurons. Neural Computation 20:8, 1951-1972. [Abstract] [PDF] [PDF Plus] 48. Per Danzl, Robert Hansen, Guillaume Bonnet, Jeff Moehlis. 2008. Partial phase synchronization of neural populations due to random Poisson inputs. Journal of Computational Neuroscience 25:1, 141-157. [CrossRef] 49. Hideyuki Câteau, Katsunori Kitano, Tomoki Fukai. 2008. Interplay between a phase response curve and spike-timing-dependent plasticity leading to wireless clustering. Physical Review E 77:5. . [CrossRef] 50. Sashi Marella. 2008. Class-II neurons display a higher degree of stochastic synchronization than class-I neurons. Physical Review E 77:4. . [CrossRef] 51. Christoph Börgers, Nancy J. Kopell. 2008. Gamma Oscillations and Stimulus SelectionGamma Oscillations and Stimulus Selection. Neural Computation 20:2, 383-414. [Abstract] [PDF] [PDF Plus] 52. Yihui Liu, Jing Yang, Sanjue Hu. 2008. Transition between two excitabilities in mesencephalic V neurons. Journal of Computational Neuroscience 24:1, 95-104. [CrossRef] 53. Selva K. Maran, Carmen C. Canavier. 2008. Using phase resetting to predict 1:1 and 2:2 locking in two neuron networks in which firing order is not always preserved. Journal of Computational Neuroscience 24:1, 37-55. [CrossRef] 54. B. S. Gutkin, J. Jost, H. C. Tuckwell. 2008. Transient termination of spiking by noise in coupled neurons. EPL (Europhysics Letters) 81:2, 20005. [CrossRef]
55. Arnaud Tonnelier, Hana Belmabrouk, Dominique Martinez. 2007. Event-Driven Simulations of Nonlinear Integrate-and-Fire NeuronsEvent-Driven Simulations of Nonlinear Integrate-and-Fire Neurons. Neural Computation 19:12, 3226-3238. [Abstract] [PDF] [PDF Plus] 56. Daisuke Ohta, Hisa-Aki Tanaka, Yusuke Maino. 2007. Two computational algorithms for deriving phase equations: Equivalence and some cautions. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 90:12, 1-9. [CrossRef] 57. G. Bard Ermentrout, Roberto F. Galán, Nathaniel N. Urban. 2007. Relating Neural Dynamics to Neural Coding. Physical Review Letters 99:24. . [CrossRef] 58. Yasuhiro Tsubo, Jun-nosuke Teramae, Tomoki Fukai. 2007. Synchronization of Excitatory Neurons with Strongly Heterogeneous Phase Responses. Physical Review Letters 99:22. . [CrossRef] 59. Magnus Richardson. 2007. Firing-rate response of linear and nonlinear integrate-and-fire neurons to modulated current-based and conductance-based synaptic drive. Physical Review E 76:2. . [CrossRef] 60. Yasuhiro Tsubo, Masahiko Takada, Alex D. Reyes, Tomoki Fukai. 2007. Layer and frequency dependencies of phase response properties of pyramidal neurons in rat motor cortex. European Journal of Neuroscience 25:11, 3429-3441. [CrossRef] 61. Svetlana Postnova, Karlheinz Voigt, Hans A. Braun. 2007. Neural Synchronization at Tonic-to-Bursting Transitions. Journal of Biological Physics 33:2, 129-143. [CrossRef] 62. Ho Young Jeong, Boris Gutkin. 2007. Synchrony of Neuronal Oscillations Controlled by GABAergic Reversal PotentialsSynchrony of Neuronal Oscillations Controlled by GABAergic Reversal Potentials. Neural Computation 19:3, 706-729. [Abstract] [PDF] [PDF Plus] 63. Marlene Bartos, Imre Vida, Peter Jonas. 2007. Synaptic mechanisms of synchronized gamma oscillations in inhibitory interneuron networks. Nature Reviews Neuroscience 8:1, 45-56. [CrossRef] 64. Yasuomi Sato, Masatoshi Shiino. 2007. Generalization of coupled spiking models and effects of the width of an action potential on synchronization phenomena. Physical Review E 75:1. . [CrossRef] 65. Toru Aonishi, Keisuke Ota. 2006. Statistical Estimation Algorithm for Phase Response Curves. Journal of the Physical Society of Japan 75:11, 114802. [CrossRef] 66. Bard Ermentrout. 2006. Gap junctions destroy persistent states in excitatory networks. Physical Review E 74:3. . [CrossRef] 67. A. N. Burkitt. 2006. A Review of the Integrate-and-fire Neuron Model: I. Homogeneous Synaptic Input. Biological Cybernetics 95:1, 1-19. [CrossRef] 68. Takashi Kanamaru . 2006. Analysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory ConnectionsAnalysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory
and Inhibitory Connections. Neural Computation 18:5, 1111-1131. [Abstract] [PDF] [PDF Plus] 69. W. Govaerts , B. Sautois . 2006. Computation of the Phase Response Curve: A Direct Numerical ApproachComputation of the Phase Response Curve: A Direct Numerical Approach. Neural Computation 18:4, 817-847. [Abstract] [PDF] [PDF Plus] 70. Henry Abarbanel, Sachin Talathi. 2006. Neural Circuitry for Recognizing Interspike Interval Sequences. Physical Review Letters 96:14. . [CrossRef] 71. Philip Holmes, Robert J. Full, Dan Koditschek, John Guckenheimer. 2006. The Dynamics of Legged Locomotion: Models, Analyses, and Challenges. SIAM Review 48:2, 207. [CrossRef] 72. Jeff Moehlis, Eric Shea-Brown, Herschel Rabitz. 2006. Optimal Inputs for Phase Models of Spiking Neurons. Journal of Computational and Nonlinear Dynamics 1:4, 358. [CrossRef] 73. Dominique Martinez . 2005. Oscillatory Synchronization Requires Precise and Balanced Feedback Inhibition in a Model of the Insect Antennal LobeOscillatory Synchronization Requires Precise and Balanced Feedback Inhibition in a Model of the Insect Antennal Lobe. Neural Computation 17:12, 2548-2570. [Abstract] [PDF] [PDF Plus] 74. Máté Lengyel, Jeehyun Kwag, Ole Paulsen, Peter Dayan. 2005. Matching storage and recall: hippocampal spike timing–dependent plasticity and phase response curves. Nature Neuroscience 8:12, 1677-1683. [CrossRef] 75. Shinji Doi, Sadatoshi Kumagai. 2005. Generation of Very Slow Neuronal Rhythms and Chaos Near the Hopf Bifurcation in Single Neuron Models. Journal of Computational Neuroscience 19:3, 325-356. [CrossRef] 76. Dominique Martinez. 2005. Detailed and abstract phase-locked attractor network models of early olfactory systems. Biological Cybernetics 93:5, 355-365. [CrossRef] 77. Amanda Preyer, Robert Butera. 2005. Neuronal Oscillators in Aplysia californica that Demonstrate Weak Coupling In Vitro. Physical Review Letters 95:13. . [CrossRef] 78. Eun-Hyoung Park, Ernest Barreto, Bruce J. Gluckman, Steven J. Schiff, Paul So. 2005. A Model of the Effects of Applied Electric Fields on Neuronal Synchronization. Journal of Computational Neuroscience 19:1, 53-70. [CrossRef] 79. A. Tonnelier . 2005. Categorization of Neural Excitability Using Threshold ModelsCategorization of Neural Excitability Using Threshold Models. Neural Computation 17:7, 1447-1455. [Abstract] [PDF] [PDF Plus] 80. Takashi Kanamaru , Masatoshi Sekine . 2005. Synchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of InteractionsSynchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections
and Their Dependences on the Forms of Interactions. Neural Computation 17:6, 1315-1338. [Abstract] [PDF] [PDF Plus] 81. Alexander Neiman, David Russell. 2005. Models of stochastic biperiodic oscillations and extended serial correlations in electroreceptors of paddlefish. Physical Review E 71:6. . [CrossRef] 82. Jonathan E. Rubin. 2005. Surprising Effects of Synaptic Excitation. Journal of Computational Neuroscience 18:3, 333-342. [CrossRef] 83. Roberto Galán, G. Ermentrout, Nathaniel Urban. 2005. Efficient Estimation of Phase-Resetting Curves in Real Neurons and its Significance for Neural-Network Modeling. Physical Review Letters 94:15. . [CrossRef] 84. Christoph Börgers , Nancy Kopell . 2005. Effects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory NeuronsEffects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory Neurons. Neural Computation 17:3, 557-608. [Abstract] [PDF] [PDF Plus] 85. Benjamin Pfeuty , Germán Mato , David Golomb , David Hansel . 2005. The Combined Effects of Inhibitory and Electrical Synapses in SynchronyThe Combined Effects of Inhibitory and Electrical Synapses in Synchrony. Neural Computation 17:3, 633-670. [Abstract] [PDF] [PDF Plus] 86. Robert Clewley, Horacio G. Rotstein, Nancy Kopell. 2005. A Computational Tool for the Reduction of Nonlinear ODE Systems Possessing Multiple Scales. Multiscale Modeling & Simulation 4:3, 732. [CrossRef] 87. Andrey Shilnikov, Gennady Cymbalyuk. 2005. Transition between Tonic Spiking and Bursting in a Neuron Model via the Blue-Sky Catastrophe. Physical Review Letters 94:4. . [CrossRef] 88. Zsófia Huhn, Gergő Orbán, Péter Érdi, Máté Lengyel. 2005. Theta oscillation-coupled dendritic spiking integrates inputs on a long time scale. Hippocampus 15:7, 950-962. [CrossRef] 89. R. Guantes, Gonzalo de Polavieja. 2005. Variability in noise-driven integrator neurons. Physical Review E 71:1. . [CrossRef] 90. Giancarlo La Camera , Alexander Rauch , Hans-R. Lüscher , Walter Senn , Stefano Fusi . 2004. Minimal Models of Adapted Neuronal Response to In Vivo–Like Input CurrentsMinimal Models of Adapted Neuronal Response to In Vivo–Like Input Currents. Neural Computation 16:10, 2101-2124. [Abstract] [PDF] [PDF Plus] 91. T. Kanamaru, M. Sekine. 2004. An Analysis of Globally Connected Active Rotators With Excitatory and Inhibitory Connections Having Different Time Constants Using the Nonlinear Fokker–Planck Equations. IEEE Transactions on Neural Networks 15:5, 1009-1017. [CrossRef] 92. E.M. Izhikevich. 2004. Which Model to Use for Cortical Spiking Neurons?. IEEE Transactions on Neural Networks 15:5, 1063-1070. [CrossRef]
93. Peter E. Latham, Sheila Nirenberg. 2004. Computing and Stability in Cortical NetworksComputing and Stability in Cortical Networks. Neural Computation 16:7, 1385-1412. [Abstract] [PDF] [PDF Plus] 94. Eric Brown , Jeff Moehlis , Philip Holmes . 2004. On the Phase Reduction and Response Dynamics of Neural Oscillator PopulationsOn the Phase Reduction and Response Dynamics of Neural Oscillator Populations. Neural Computation 16:4, 673-715. [Abstract] [PDF] [PDF Plus] 95. T Takekawa, T Aoyagi, T Fukai. 2004. Influences of synaptic location on the synchronization of rhythmic bursting neurons. Network: Computation in Neural Systems 15:1, 1-12. [CrossRef] 96. R. M. Ghigliazza, P. Holmes. 2004. A Minimal Model of a Central Pattern Generator and Motoneurons for Insect Locomotion. SIAM Journal on Applied Dynamical Systems 3:4, 671. [CrossRef] 97. Jonathan Drover, Jonathan Rubin, Jianzhong Su, Bard Ermentrout. 2004. Analysis of a Canard Mechanism by Which Excitatory Synaptic Coupling Can Synchronize Neurons at Low Firing Frequencies. SIAM Journal on Applied Mathematics 65:1, 69. [CrossRef] 98. Bard Ermentrout . 2003. Dynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal NetworksDynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal Networks. Neural Computation 15:11, 2483-2522. [Abstract] [PDF] [PDF Plus] 99. Jan Benda , Andreas V. M. Herz . 2003. A Universal Model for Spike-Frequency AdaptationA Universal Model for Spike-Frequency Adaptation. Neural Computation 15:11, 2523-2564. [Abstract] [PDF] [PDF Plus] 100. Nicolas Brunel , Peter E. Latham . 2003. Firing Rate of the Noisy Quadratic Integrate-and-Fire NeuronFiring Rate of the Noisy Quadratic Integrate-and-Fire Neuron. Neural Computation 15:10, 2281-2306. [Abstract] [PDF] [PDF Plus] 101. Gianluigi Mongillo, Daniel J. Amit, Nicolas Brunel. 2003. Retrospective and prospective persistent activity induced by Hebbian learning in a recurrent cortical network. European Journal of Neuroscience 18:7, 2011-2024. [CrossRef] 102. Jason Ritt. 2003. Evaluation of entrainment of a nonlinear neural oscillator to white noise. Physical Review E 68:4. . [CrossRef] 103. Masaki Nomura , Tomoki Fukai , Toshio Aoyagi . 2003. Synchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical SynapsesSynchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical Synapses. Neural Computation 15:9, 2179-2198. [Abstract] [PDF] [PDF Plus] 104. Benjamin Lindner , André Longtin , Adi Bulsara . 2003. Analytic Expressions for Rate and CV of a Type I Neuron Driven by White Gaussian NoiseAnalytic Expressions for Rate and CV of a Type I Neuron Driven by White Gaussian Noise. Neural Computation 15:8, 1761-1788. [Abstract] [PDF] [PDF Plus]
105. Toshio Aoyagi , Takashi Takekawa , Tomoki Fukai . 2003. Gamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal NeuronsGamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal Neurons. Neural Computation 15:5, 1035-1061. [Abstract] [PDF] [PDF Plus] 106. M. Denman-Johnson, S. Coombes. 2003. Continuum of weakly coupled oscillatory McKean neurons. Physical Review E 67:5. . [CrossRef] 107. Christoph Börgers , Nancy Kopell . 2003. Synchronization in Networks of Excitatory and Inhibitory Neurons with Sparse, Random ConnectivitySynchronization in Networks of Excitatory and Inhibitory Neurons with Sparse, Random Connectivity. Neural Computation 15:3, 509-538. [Abstract] [PDF] [PDF Plus] 108. D. Hansel , G. Mato . 2003. Asynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory NeuronsAsynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory Neurons. Neural Computation 15:1, 1-56. [Abstract] [PDF] [PDF Plus] 109. Bard Ermentrout, Jonathan D. Drover. 2003. Nonlinear Coupling near a Degenerate Hopf (Bautin) Bifurcation. SIAM Journal on Applied Mathematics 63:5, 1627. [CrossRef] 110. Eugene M. Izhikevich, Frank C. Hoppensteadt. 2003. Slowly Coupled Oscillators: Phase Dynamics and Synchronization. SIAM Journal on Applied Mathematics 63:6, 1935. [CrossRef] 111. Sorinel A. Oprisan , Carmen C. Canavier . 2002. The Influence of Limit Cycle Topology on the Phase Resetting CurveThe Influence of Limit Cycle Topology on the Phase Resetting Curve. Neural Computation 14:5, 1027-1057. [Abstract] [PDF] [PDF Plus] 112. Jianfeng Feng , Guibin Li . 2002. Impact of Geometrical Structures on the Output of Neuronal Models: A Theoretical and Numerical AnalysisImpact of Geometrical Structures on the Output of Neuronal Models: A Theoretical and Numerical Analysis. Neural Computation 14:3, 621-640. [Abstract] [PDF] [PDF Plus] 113. Bard Ermentrout, Jonathan Rubin, Remus Osan. 2002. Regular Traveling Waves in a One-Dimensional Network of Theta Neurons. SIAM Journal on Applied Mathematics 62:4, 1197. [CrossRef] 114. Jianfeng Feng, Gang Wei. 2001. Journal of Physics A: Mathematical and General 34:37, 7493-7509. [CrossRef] 115. Carlo R. Laing , Carson C. Chow . 2001. Stationary Bumps in Networks of Spiking NeuronsStationary Bumps in Networks of Spiking Neurons. Neural Computation 13:7, 1473-1494. [Abstract] [PDF] [PDF Plus] 116. Bard Ermentrout , Matthew Pascal , Boris Gutkin . 2001. The Effects of Spike Frequency Adaptation and Negative Feedback on the Synchronization of Neural OscillatorsThe Effects of Spike Frequency Adaptation and Negative Feedback on
the Synchronization of Neural Oscillators. Neural Computation 13:6, 1285-1310. [Abstract] [PDF] [PDF Plus] 117. L. Neltner , D. Hansel . 2001. On Synchrony of Weakly Coupled Neurons at Low Firing RateOn Synchrony of Weakly Coupled Neurons at Low Firing Rate. Neural Computation 13:4, 765-774. [Abstract] [PDF] [PDF Plus] 118. Gennady S. Cymbalyuk , Girish N. Patel , Ronald L. Calabrese , Stephen P. DeWeerth , Avis H. Cohen . 2000. Modeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI ApproachModeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI Approach. Neural Computation 12:10, 2259-2278. [Abstract] [PDF] [PDF Plus] 119. L. Neltner , D. Hansel , G. Mato , C. Meunier . 2000. Synchrony in Heterogeneous Networks of Spiking NeuronsSynchrony in Heterogeneous Networks of Spiking Neurons. Neural Computation 12:7, 1607-1641. [Abstract] [PDF] [PDF Plus] 120. Jan Karbowski , Nancy Kopell . 2000. Multispikes and Synchronization in a Large Neural Network with Temporal DelaysMultispikes and Synchronization in a Large Neural Network with Temporal Delays. Neural Computation 12:7, 1573-1606. [Abstract] [PDF] [PDF Plus] 121. Eugene M. Izhikevich. 2000. Phase Equations for Relaxation Oscillators. SIAM Journal on Applied Mathematics 60:5, 1789. [CrossRef] 122. E.M. Izhikevich. 1999. Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Transactions on Neural Networks 10:3, 499-507. [CrossRef] 123. E.M. Izhikevich. 1999. Weakly pulse-coupled oscillators, FM interactions, synchronization, and oscillatory associative memory. IEEE Transactions on Neural Networks 10:3, 508-526. [CrossRef] 124. Boris S. Gutkin , G. Bard Ermentrout . 1998. Dynamics of Membrane Excitability Determine Interspike Interval Variability: A Link Between Spike Generation Mechanisms and Cortical Spike Train StatisticsDynamics of Membrane Excitability Determine Interspike Interval Variability: A Link Between Spike Generation Mechanisms and Cortical Spike Train Statistics. Neural Computation 10:5, 1047-1065. [Abstract] [PDF] [PDF Plus] 125. Sharon M. Crook, G. Bard Ermentrout, James M. Bower. 1998. Spike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical OscillatorsSpike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical Oscillators. Neural Computation 10:4, 837-854. [Abstract] [PDF] [PDF Plus]
Communicated by Geoffrey Goodhill
On Neurodynamics with Limiter Function and Linsker’s Developmental Model Jianfeng Feng* Mathematisches lnstitut, Universitat Miinchen, Theresienstr. 39, D-80333 Miinchen, Germany
Hong Pan Vwani P. Roychowdhury School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907-1285 USA
The limiter function is used in many learning and retrieval models as the constraint controlling the magnitude of the weight or state vectors. In this paper, we developed a new method to relate the set of saturated fixed points to the set of system parameters of the models that use the limiter function, and then, as a case study, applied this method to Linsker’s Hebbian learning network. We derived a necessary and sufficient condition to test whether a given saturated weight or state vector is stable or not for any given set of system parameters, and used this condition to determine the whole regime in the parameter space over which the given state is stable. This approach allows us to investigate the relative stability of the major receptive fields reported in Linsker’s simulations, and to demonstrate the crucial role played by the synaptic density functions. 1 Introduction
The limiferfunction (also referred to as the saturating linear function, or the piecewise linear sigmoidal function) in the following form:
is widely used as the constraint limiting the size of the weight or the state vectors in several learning and retrieval models [ e g , Linsker’s and Miller’s self-organization networks and the Brain-State-in-a-Box (BSB) model]. A class of models that use the limiter function can be defined ‘Current address: Statistics Group, The Barbraham Institute, Barbraham, Cambridge. CB2 4AT, United Kingdom.
Neural Computation 8,1003-1019 (1996) @ 1996 Massachusetts Institute of Technology
J. Feng, H. Pan, and V. P. Roychowdhury
1004
by a system of first-order nonlinear difference equations of the following form:
+
V i = 1 . .. . .N
(14 where N is the number of weights or neurons, the limiter functionf(.) ~ ] 'weight , or the state vector defines a hypercube 0 = [ W ~ , ~ . U J ~ ~the w, = {zu,(i). i = 1. . . . . N} E 0, and ] I , ( . ) is some continuous parameterized function specified by the model under consideration. Most commonly, the satiiratedfised point attractors2 (i.e., the saturated stable equilibriums) of equation 1.2 represent the outcomes of a learning or retrieval process, e.g., connection patterns or associative memories (also see Miller and MacKay 1994). To predict the outcomes of these models, one needs to understand the dynamical mechanisms of equation 1.2. In general, however, it is intractable to completely characterize the behavior of a nonlinear system. The difficulties lie in both determining the set of terminal attractors, as well as in characterizing their basins of attraction in the weight space (for learning models) or the state space (for retrieval models). As reviewed later in this section, partial information about the dynamical mechanism of this class of models can be obtained by various methods, and we next discuss one of the desirable and dynamically informative approaches that aims at studying the pnrametcr spacc. The configuration of the state space of dynamics in equation 1.2 is determined by the system parameters embodied in h i ( . ) . As observed in various simulations (Erwin et al. 1995; Miller 1996) and as to be expected for a nonlinear dynamical system, a model of the class stated in equation 1.2 can have a group of coexisting attractors for a given set of system parameters, and have different groups of coexisting attractors for different sets of system parameters. Thus, a given pattern could be an attractor of the dynamics under certain choices of the system parameters, and could be unstable under other choices. The system parameters, thus, have state-related thresholds (also referred to as critical zialues of the parameters), which define the stable and unstable parameter regimes in the parameter space for the corresponding patterns or states. Hence, a characterization of the parameter space would enable one to predict the relationship between the different sets of system parameters and the corresponding fixed point attractors. In this paper, we developed a new method to relate the set of saturated fixed points to the set of system parameters, and then, as a case study, applied this method to Linsker's Hebbian learning network. In particular, we showed that it is possible to derive a necessary and sufficient z ~ ~ +=~f [(z oi T) ( i ) h,(w,)].
-
'For Linsker's and Miller's correlation-based developmental models (see Linsker 1986; Miller and Stryker 1989), for example, the Hebbian rule ensures that the elements of the weight vector will increase or decrease to the upper or the lower bounds. For the BSB model, the divergency of all trajectories is ensured by the diagonal-dominant and/or symmetric weight matrix satisfying certain conditions (see Golden 1986, 1993; Goles and Martinez 1990; Greenberg 1988; Hui and Zak 1992).
Neurodynamics with Limiter Function
1005
condition to test whether a given weight or state vector is a saturated fixed point attractor for any given set of system parameters without loss of mathematical rigor. In terms of this condition, one can assert the potential occurrence of a fixed point attractor when the system parameters are chosen in its stable parameter regime, or the instability of this fixed point when the choice of parameters lies outside that parameter regime. Using this scheme, we derived a new rigorous criterion for division of stable parameter regimes in which Linsker ’s network has the potential to develop specially designated connection patterns [also referred to as afferent receptive fields (aRFs)],and used the criterion to investigate the relative stability of several types of dominant aRFs that were reported in Linsker ’s simulations or appeared as the major eigenfunctions studied in MacKay and Miller (1990a,b). Before we present the details of our method, it is instructive to review two kinds of conventional approaches that have been employed in the stability analysis of this class of models: 0
0
Liapunov’s direct method as shown in Golden (1986, 1993) and Linsker (1986, 198813). But, usually, an Invariant Set Theorem type of analysis is not practically informative for our purposes, in the sense that it barely provides any prediction about the relationship between the system parameters and the set of attractors. The linearization approach in terms of the properties of the eigenvectors and eigenvalues as presented in MacKay and Miller (1990a,b) and Miller (1990, 1996). By assuming that the principal features of the dynamics are established before the weight boundaries are reached, the short-time evolution of weight vectors in Linsker’s and Miller’s weight dynamics can be characterized in terms of the properties of the eigenvectors and eigenvalues associated with the linear system zuT+l(i) = wT(i)+hl(wT) . V i = 1... . . N. With this approach, a number of strikingly informative results that explain Linsker’s and Miller’s simulations (for example, a general spectrum of eigenvalues and a few types of dominant eigenfunctions) have been obtained. This approach, however, does not clearly show the effects of all the parameters known to play crucial roles in the models, e.g., for Linsker‘s network, the role played by the synaptic arbor density between two consecutive layers has not been addressed. The explicit stability analysis in the parameter space, as presented in this paper, can provide such information and could be complementary to the linearization method (see further discussions in Section 3.2).
Due to space limitations, the rigorous proofs and other technical details of our results cannot be included here, and can be found in Feng et al. (199513).
J. Feng, H. Pan, and V. P. Roychowdhury
1006
2 General Theorem about Saturated Fixed Point Attractors
The key observation in this approach is fairly direct, and is based on the special form of the nonlinear function f ( . ) in equation 1.1,which is a strictly increasing function in its linear region. It is well known that a fixed point or an equilibrium state of equation 1.2 satisfies 7uT(i)= f [ w T ( i ) h,(w7)]. V i = 1. . . . . N. One can now verify that if a weight or state vector w is a solution of this fixed point equation and hi(w) # 0 ti i = 1.. . . .N,3 then (1) w is a saturated fixed point, and ( 2 ) 7u(i) h,(w) > wrnaxror ~ ( i t) h,(w) < 7umln should be satisfied for all i = 1... . . N. That is, any generic fixed point w must be a saturated fixed point, and must satisfy that h,(w) > 0 ti iuji) = wrnax,and h,(w) < 0 t/ w(z) = w,,".This gives a necessary and sufficient condition for checking whether a given weight or state vector is a generic fixed point of equation 1.2. By using the above idea with stability considerations, we derived Theorem 1 (proven in Feng ef al. 1995b) as follows.
+
+
Theorem 1. The whole set ofgenericfixed point attracfors (denoted as ilrpA) uf the dynamics in equation 1.2 uiith the limiterfiinction in equation 7.1 is given by !2Fp4
=
{w 1 h,(w)> 0 V 7 4 i ) = zu,,. h,(w)< 0 ti zu(i) = 2 ~ 7 , ~ ~ . 1 5 i 5
N}
(2.1)
wliere the weight or state z~ectorw belongs to the set of all extrenie points of the 0 hypercube (2 denoted as V ( 0 ) . Based on the above general theorem, one can further explore the internal structure of h,(w),and derive more specific criteria, as demonstrated in the following section. 3 A Qualitative Analysis of Linsker-Type Hebbian Learning
~
3.1 The Criterion for the Division of Parameter Regimes. In Linsker's network, each cell in the present layer ,2.1 receives synaptic inputs from a number of cells in the preceding layer C, and the dynamical equation for the development of the synaptic strength z ~ ( i between ) an .U-cell and the ith C-cell at time 7 is sc wTA,(i)=f 7 u T ( i ) kl C[Qf k 2 ] r ( j ) 7 4 ( j ) (3.1) -.
{
+ +
+
\=1
i
'Note that if w is a fixed point and /t,(w)= 0, for some I t (1... . . N),then this fixed point might be unsaturated. But the cases when Ii,jw) = 0, for some i E (1.. . . .N} are nongeneric for the models under consideration (e.g.,see Anderson ct nl. 1977; Erwin et id.1995; Golden 1986, 1993; Linsker 1986, 1988a,b; Miller 1996; Miller and Strykrr 1989). From now on, we shall call a fixed point w with I I , ( w ) # 0 ti I = 1 . .. . .N a genericfixed point, and a fixed point w with h , ( w )= 0, for some i E 11. . . . . N} a liongenericfixed point of equation 1.2.
Neurodynamics with Limiter Function
1007
where kl k2 are system parameters that are particular combinations of the constants of the Hebbian rule, Nc is the total number of the cells in layer L, r ( . ) is a nonnegative normalized synaptic density function (SDF)4 that satisfies r ( i ) = 1, andf(.) is the limiter function in equation 1.1 with wmax = 1 and w,,, = -l.5 The full-rank covariance matrix Qc = i.1 = 1,.. . NL}of the layer C describes the correlation of activities of the L-cells, and is determined by the SDFs r ( . ) s of all layers preceding layer C. A given SDF rLM(k,j),b’k E M , j E C will be said to have a range rM if rcM (k,j ) is sufficiently small for 1 /k -ji 1 2 rM. For instance, if we assume the SDF to be gaussian, i.e., rcM(k.j) exp(-llk - j112/2?M), b’k E M , j E L, then the standard deviation rM of this function is its range. Note that the dynamics of Linsker’s network as stated in equation 3.1, is in the form of equation 1.2, where h,(w,) = kl E,”=.,[Qik2]r(j)w,(j), and w, = {w,(j),j = 1 , . . .Ale}. The simulation results reported in Linsker (1986, 1988a,b) showed that for appropriate parameter regimes, several kinds of structured connection patterns (e.g., center-surround and oriented aRFs) occur progressively as the Hebbian evolution of the weights is carried out layer by layer. Moreover, Linsker (1986) proved and observed that the only possible outcomes from the dynamics of Linsker’s network are all, or all but one, saturated patterns. Hence, the analysis in Section 2 on generic fixed points can be directly applied to study the outcomes of Linsker’s network.6 Theorem 1, then, implies that the whole set of saturated fixed point attractors of the dynamics in equation 3.1 is
zlEc
{Qi. ~
-
+
+
RFPA= {w I h,(w)w(i) > 0 , l 5 i 5 N}. (34 Putting in the explicit form of h,(w) in equation 3.1, we can directly derive an explicit necessary and sufficient condition for the emergence of various aRFs in the following theorem. Definition 1. For any w E V(R), we define J’(w) = {i I w(i) = 1) as the index set of cells at layer C with excitatory weight for a connection 4TheSDF is explicitly incorporated into the dynamics (equation 3.1), which are equivalent to Linsker’s formulation. A rigorous explanation for this equivalence is given in MacKay and Miller (1990a). 5From the procedures of our proofs in Feng et al. (1995b), one can easily verify that the results that come later in this paper are valid for the case where wmax= HEM and w,h = nEM - 1, which are the constants of the limiter function used in Linsker’s simulations (Linsker 1986). ‘The exclusion of all possible nongeneric fixed points in our analysis can be further justified as follows: (1)Nongeneric saturated fixed points can be ignored in the parameter space: For a given nongeneric saturated fixed point, the set of parameters (kl,kz) satisfying h , ( w ) = 0 (for some i) is at most a straight line in the (kl,kz) space. Since there are at most finitely many nongeneric saturated fixed points, the stable regime of all possible such fixed points has measure zero on the (kl,k2) space. (2) The fixed points with all but one of the weights saturated will be sufficiently addressed by the analysis of their generic saturated counterparts: The stable parameter regime for a given allbut-one saturated pattern is approximately a subset of the stable parameter regime of its corresponding generic saturated pattern, as shown in Feng et al. (1995b).
J. Feng, H. Pan, and V. P. Roychowdhurp
1008
pattern w of an .tl-cell, and / ( w ) = C-cells with inhibitory weight for w.
(1
1 w(i)= -1}
as the index set of 0
Definition 2. We define the sloprfiinction: ciw) %f
C r ( i ) w ( j )= C E L
/El+IWi
r ( / )-
r(/) E-iwi
which is the difference of sums of the SDF r ( . ) over / + ( w )and ]-(w), and is also the azlerage synaptic strength of the connection pattern w, and two k, -interceptfiinctiorzs:
[-
max,Ef+,w)
Q f r ( j ) N / ) ]= [Cir/-iw) QCru) - C,E/+lw) Q,Sr(/)] . if I'(w) # L4 if ]+(w) = 0 &L
and minlEl-lwi
[-
C / E L Q f r ( l ) Z u ~ J= )] ['E,c~-lw, Q f r ( / ) - C,Fi+,w) Qfr(l!] . if /-(w) # 0 if ]-(w) = 0
0
Theorem 2. For any two consecutive layers in Linsker's network, a Connection pattern w is a saturatedfixed point attractor of equation 3.1 ifand only if
>
C l 2 ( ~ ) kl
7
~(w)k2 > dl(w)
(3.3)
Hence, ifdZ(w) < d l ( w ) ,then the connection pattern w is not an attractor of equation 3.1. 0 Theorem 2 states that for any given set of SDFs, the parameter regime to ensure that a structured aRF w is an attractor of equation 3.1 IS a band between two parallel lines kl +c(w)k2 > dl(w) and kl +c(w)k2 < d2(w) (see Regime E and Regime F i n Fig. l).7Note that if &(w)< dl(w), then there is no regime of ( k l . k,) for the occurrence of that aRF w as an attractor of equation 3.1. Therefore, between any two consecutive layers in Linsker's network, the existence of a structured aRF w as an attractor of equation 3.1 is determined by kl-intercept functions t i l ( . ) and d2(.), and therefore by the SDFs r ( . ) s of all preceding layers and the present layer.8 Using this condition, one can obtain, without referring to equation
of
( k l . k2)
'We shall call any afferent receptive field, except for the all-excitatory and the allinhibitory connection patterns, a sfructured oRF. "The definition of the slope function c(w) implies that it depends only on the SDF I(.) between the two successive layers under consideration and does not relate to the SDFs d.)sof the other preceding layers. Two kl-intercept functions dl(w) and d*(w) embody the dependence of equation 3.1 on the covariance matrix QC of the preceding layer C,and the SDF r ( - ) between the two successive layers under consideration (i.e., between layers M and L ) . Therefore, these two kl-intercept functions are determined by the SDFs of all preceding layers and the present layer.
Neurodynamics with Limiter Function
1009
3.1, the whole parameter regime of ( k l . k2. rL,,.YL,,-,. . . . . rL,) for an n-layer Linsker's network in which a given connection pattern between layer Lpl-l and layer L,, will be an attractor of equation 3.1. Unlike for any other aRF, there always exists a stable parameter regime for the all-excitatory and the all-inhibitory connection patterns. We denote the kl-intercept function d l ( w )for the all-excitatory aRF as d l (+) and d 2 ( w ) for the all-inhibitory aRF as d2(-), respectively. From the above theorem and the definition of kl-intercept functions, the all-excitatory aRF is an attractor of equation 3.1 when kl k2 > dl(+), and so is the all-inhibitory aRF when kl - k2 < d2(-). Thus the parameter plane of (kl.k2) is divided into four regimes by these two criteria, and the regime (see Regime D in Table 1 and Fig. 1) determined by kl + k2 < dl(+) and kl - k2 > d 2 ( - ) is where neither the all-excitatory nor the all-inhibitory connection patterns are stable. Although the values of d,(.) and d 2 ( . ) [-1 5 d,(.).d2(.) 5 11 may change from layer to layer, the division of the parameter regimes shown in Figure 1 remains invariant. We summarize the principal parameter regimes from our general theorem applicable to all layers in Table 1 and Figure 1. In summary, Figure 1 provides a general and yet precise picture on the ( k l ,k 2 ) plane for the stable regimes of aRFs between any two consecutive layers of Linsker's network. Clearly, to obtain exact information about and to manipulate the stable regimes, one needs to incorporate the ranges of SDFs as well; this is demonstrated in Section 3.2. Figure 1 also underscores the need for approaches such as linearization (see MacKay and Miller 1990a,b), which enables one to identify the dominant aRFs and thus obviates the need to study the stability of all possible aRFs.
+
3.2 Parameter Regimes for aRFs in the First Three Layers. Based on our general theorem applicable to all layers, we confine ourselves to synaptic connections in the first three layers of Linsker's network. Denote the SDFs from layer A to B and from B to C as ydB(., .) with range rg and rBC(.)with range rc, respectively. The emergence of various aRFs in the first three layers have been previously studied in the literature (Linsker 1986, 1988a,b; MacKay and Miller 1990a,b; Tang 1989), and in this paper we mention only the following new results made possible by our approach. 3.2.1 Development of Connections between Layers Aand B. As in Linsker's simulations, we assume that the random input at first layer A has an independent normal distribution with mean 0 and variance 1. That is, Q4 = 1 if i = j , and Qf = 0 if i # j . Hence, applying Theorem 2, one can 'I verify the following:
1. The stable parameter regime for the all-excitatory pattern satisfies: kl+k2 > d l ( + ) = - minitA #"(i) 2 -1/Nd; and the stable parameter
Regime G = E’s n F s n ( A u B u C ) (Fig. Id) Regime H = E or F with c(w)= 0 (Fig. I d )
Regime F (Fig. lb)
Regime E (Fig. la)
Regime D (Fig. lc)
Regime A (Fig. lc) Regime B (Fig. lc) Regime C (Fig. lc)
Type of regime
Parameter regime
Table 1: The General Principal Parameter Regimes.
The all-excitatory connection pattern is the only attractor except for Regime G The all-inhibitory connection pattern i:, the only attractor except for Regime G The all-excitatory and the all-inhibitory connection patterns coexist with each other and with other stable structured aRFs The all-excitatory and the all-inhibitory connection patterns are unstable. The stable structured aRFs may have separate parameter regimes when k2 is large and negative. Linskrr’s sinrrrlatioir rrsirlts O H tlrr c’nrryerzn~ofstrirctirrrcl iiRFs ore obtnirird in lic$rrir~ 0 Stable structured aRFs with positive average synaptic strength (e.g., an ON-center cell with large excitatory center radius r,,,, o r an OFF-center cell with small inhibitory rcorc) Stable structured aRFs with negative average synaptic strength (e.g., an ON-center cell with small excitatory rco,, or an OFF-center cell with large inhibitory r,,,,) There exists a small coexistence regime of many stable aRFs around the origin point of the (kl . k ~ ) plane Several stable structured aRFs with C ( W ) = 0 might coexist
A ttractors
?-
1011
Neurodynamics with Limiter Function
Figure 1: A general division of the stable parameter regimes in the ( k 1 . k ~ ) plane. Each figure is described in Table 1. The segmentation is determined by the necessary and sufficient condition (stated in Theorem 2) that every stable saturated connection pattern between any two consecutive layers of Linsker‘s network should satisfy. The boundaries of the stable regime for any given saturated pattern between any two consecutive layers can be exactly calculated (using the necessary and sufficient condition) once the set of SDFs is chosen. Notice that for the weight development between any two consecutive layers L: and M , the number of the bands of E or F type in the ( k l , k 2 ) plane can range from 0 to 2Nc-1 and will be determined by the set of SDFs under consideration (see Section 3.2). regime for the all-excitatory pattern satisfies: kl mini,A rA”(i) 5 I/NA.
-
k2
< L&(-)
=
2. If the SDF, f i B ( . ) , is positive (e.g., gaussian with range YB like in Linsker 1986), then every structured aRF w has a corresponding
1012
J. Feng, H. Pan, and V. P. Roychowdhury
stable parameter regime that satisfies & ( w ) = rd"(i) > kl + c(w)k2> - minrEi+,w) = dljw). That is, the existence of a stable parameter regime for any aRF is independent ofthe third parameter r B , and all 2NA-1 possible structured aRFs coexist. Therefore, it is impossible to emerge any structured aRF between layers A and 23 without regard to the initial condition. But, for a developmental model like Linsker's network, it is expected that the different aRFs should emerge under different sets of parameters and should be relatively insensitive to the initial conditions. In the deeper layers of Linsker's network, as demonstrated next, the incorporation of more parameters (i.e., more ranges of SDFs) allows one to make selected sets of aRFs unstable for certain choices of parameters, and thus can avoid coexistence of major aRFs. 3.2.2 Dreielopnzeizf of Coiznectioizs between Layers 13 and C. For the weight development from layer 13 to C,as shown in Theorem 2, the existence of a weight pattern as an attractor of the dynamics equation 3.1 is solely determined by the ranges rc and ru of the SDFs r" and @'.' 1. As we have noticed in Section 3.1, the all-excitatory or the allinhibitory weight patterns will emerge between layers 23 and C N U Bc when kl + kr > d1(+) = - minls,sx., &1 r (1) 1%@"(i. I)rdu(j.I ) or kl - k l < d z ( - ) = minl5,
Cp2
'Without loss of generality, we assume that the connection strengths from layer
A to R are all-excitatory as in Linsker's simulations. Hence the covariance matrix Q" = { Qf. i. E a}of layer R is a gaussian function with range f i r E if the SDF rarr(.. .) between layers .A and f? is gaussian with range YE. Linsker (1986) used various choices
of the ratio rz/r[; ranging from 3 I,'' to mostly rc,/ro = fi in his simulations. MacKav and Miller (1990a,b) left this ratio as a free parameter, and used the ratio of the range of Q" to the range of the SDF rLiC,C / A = 2rir/r:, = 2/3 in their examples, which is equivalent to the ratio rL-/r[;= fi.The role played by the SDFs has not been addressed in literature. We will use the ratio rc,/rL; to show the sectional drawings of the parameter subspace of (rc-.rl;) (see Figs. 2 and 3 ) . '"We use a grid system in our examples. We assume that each C-cell receives synaptic inputs from 253 sites in layer B, where 253 is the total number o f sites inside the circle with grid radius 9. We denote the radius of the central core of a ON-center or OFF-center aRF a5 rc0,, one half of the maximum width o f the excitatory area in a bi-lobed aRF as r r ? f ,one half of the width of the excitatory central strip in an oriented aRF as r,,dt,,, and then label the corresponding aRFs simply as O N (I,,,,,. 9), OFF(r,,,,. 9), BL(rBL. 9), and I,,idth. 9). I ] By MacKay and Miller's notation for the eigenvectors, the all-positive (or allnegative) pattern, the center-surround pattern, and the bi-lobed pattern are labeled with ''ls," "2s," and "2p." respectivelv. The oriented pattern does not belong to circularly
Neurodynamics with Limiter Function
1013
When YB is sufficiently small, the situation for B-to-C weight development will degenerate to the case for A-to-B where every aRF has a stable regime. But if the connectivity between layers A and B is all-to-a11 with constant SDF, then there does not exist any regime of E or F type in the ( k l r k 2 ) plane for B-to-C at all. In general, for each kind of connection pattern, the ranges rc and rB have patternrelated critical values that define the stable parameter regimes for the corresponding patterns. Based on the results in Figures 2 and 3, we make the following observations: (a) For circularly symmetric ON-center (or OFF-center) cells,12 those patterns with large ON-center (or OFF-center) core [e.g., ON(6,9), ON(7,9), ON(8,9)and their OFF-center counterparts in our examples] always have a stable parameter regime. That is, the emergence of these patterns is insensitive to the choice of YC and Y B . But for those ON-center (or OFF-center) patterns with smaller ON-center (or OFF-center) core, their stable pa) decrease in size with y,,,. rameter regimes in the (rc, r ~plane Thus the large-core ON-center patterns are dominant in the regimes EflD of the ( k l , k 2 ) plane [i.e., with positive c(w)], in the sense that other major aRFs (including ”2s” with smallcore, ”2p,” tri-lobed, etc.) can be made unstable by choosing appropriate values of rc and YB. Similarly, the large-core OFFcenter patterns are dominant in the regimes FnD of the (kl, k2) plane [i.e., with negative c(w)] in the same sense. Thus, between layers B and C, the regime HnD [with near-zero c(w)] is the only window in the (kl,k2) plane where there exists the opportunity for emerging weight patterns other than the ON(OFF-)centertype. (b) For the bi-lobed and the noncircularly-symmetric oriented patterns, only those patterns with small average weight strength Ic(w)l have a wide stable parameter regime in the (YC, YB) plane, which can match the regimes for the relatively dominant ONcenter (or OFF-center)patterns in size, and occupy the different regimes of (YC, YB) from ON-center (or OFF-center) patterns in symmetric systems introduced in MacKay and Miller (1990a,b), and is not observed in Linsker‘s B-to4 simulation, but is shown to emerge in deeper layers. I2Notice that for every ON-center cell, w, with excitatory connections inside the circle with radius r,,, and inhibitory connections outside that circle, there exists a corresponding OFF-center cell, w’, with inhibitory connections inside the circle with the same radius r,,, and excitatory connections outside that circle. Since dl (w’) = -d2(w), and d2(w’) = -d,(w), then d2(w’)- dl(w’) = d2(w) - dl(w). That is, the stable and unstable parameter regimes for w and w’ are the same in the (rc, YE) subspace. If w and w’ are stable (i,e,, d2(w’)- dl(w’) = d2(w) - dl(w) > 0), then they will also appear in the ( k , ,k2) plane, as an E-type band and an F-type band, respectively. Note that these two bands will have the same width but opposite slope values (since c(w’) = -c(w)). Therefore, we need to consider only ON-center cells because of the symmetry between the slope and intercept functions of OFF-center and ON-center cells.
1014
Intercept Functions ofON-Cmter cell (1.9) 0 0 3
J.
Feng, H. Pan, and V. P. Roychowdhury
Intercept Functions of ON-Center Cell (3.9)
i
r
001
,
.=,
R~npe of B 10
Intercept Functlons of ON-Center Cell (5.8)
10
C
Densbty Function
,m
Intrreept Functions of ONCenter Cell (7.8)
oom ,
Figure 2: Calculating the stable parameter regimes for B-to-C ON-center aRFs from the kl-intercept functions (here 0represents dl(w.re. ra) with r c / r B = &, and A represents & ( w .rc. r B ) with rc/ra = &). For any aRF, w, the existence of a stable parameter regime is determined by the two kl-intercept functions, dl ( w .rc. r B ) and &(w.re. r a ) . For the calculations of this figure, we fixed (as done in Linsker's simulations), and dl(w.re. rB) and the ratio rc/ra to &(w. rc. ra) are calculated as two functions of rc. An aRF is a stable attractor if and only if &(w.rc. ra) > dl ( w .rc. raj. We shall call the value of rc at which dzjw. rc-. r g ) - dl(w. rc. ra) turns from positive to negative as the critical value of rc. Space constrains us to show here only 4 cases out of the 8 kinds of B-to-C ON-center cells studied in Figure 3. When rc is larger than the critical value for an aRF [re 2 1.541 for ON(l.9), rc 2 2.358 for ON(2.9), rc 2 4.321 for ON(3,9), re 2 6.505 for ON(4.9), and rc 2 14.925 for ON(5.9)], then the corresponding aRF will no longer be an attractor of equation 3.1. Note that there is no critical value of rc for ON(7.9) [also for ON(6.9) and ON(8.9)], i.e., a stable regime of (kl.kZ) always exists for these aRFs when rc/ra = & and rc 5 100.
Neurodynamics with Limiter Function
1015
Figure 3: The parameter subspace of (rc, r5) for various connection patterns. The textured area in each plot is the stable parameter regime of (re, rg) in which d2(w,rc, ra) > dl(w,rc, rg), i.e., there is a corresponding stable regime in (kl,k2) subspace. If the choice of rc and rg lies outside the shaded region, then the corresponding pattern is unstable and it cannot be observed in simulations irrespective of the choices of kl, k2 and the initial conditions. In Linsker's simulations, the ratio rc/rg was mostly fixed to fi and was also varied to span the range from 3-'/2 to loll2; the corresponding range is shown as the area between two dashed lines of the (re, r g ) plane for ON(l,9) or OFF(1,9), and the often-used ratio rc/rg = fi is shown as a dotted line within the above area. The figures in Figure 2 are the cutaway views along the dotted lines in the (rc, rg) plane shown here. The critical values of (rc, r g ) for the case when r c / r g = fi are marked for ON-center and OFF-center aRFs. For other weight patterns corresponding to the eigenfunctions labeled with "3s," "3p," "3d," etc. in MacKay and Miller (1990a,b), their stable regimes are in the ( r c , r g ) plane are found to be in the same regime as shown in the case of OR(7,9) and even much smaller (i.e., only stable when r g is sufficiently small.
J. Feng, H. Pan, and V. P. Roychowdhury
1016
Figure 3: Continued
most areas of the (rc, r a ) plane except for the case when r~ is small. At the same time, the oriented (also referred to as tri-lobed, grating-like) patterns are always overshadowed by either center-surround patterns or bi-lobed patterns and have no exclusive stable area in the (r(-,ra) plane, although they may coexist with those circularly symmetric patterns in certain parameter regimes. But it is important to notice that there indeed exist parameter areas of (rc, r a ) in which the bi-lobed patterns [ e g , BL(4.9) and BL(5.9) in our examples] become dominant, while the center-surround and the oriented patterns are unstable. Figure 3 also shows that this character is relatively insensitive to the ratio r c / r a (e.g., from 3-'/* to lo'/*) but depends on the specific values of rc and r B chosen in simulation. Moreover, since the average synaptic strength c(w) of these bilobed patterns is near zero, their stable parameter regimes can be exclusive from others' in the (kl.k2. re. ra) space. (c) We have so far discussed the relative stability of only a few types of aRFs; however, for the weight development between
Neurodynamics with Limiter Function
1017
layers 23 and C, there are 2N8-1 kinds of possible saturated structured aRFs. Obviously, it is unrealistic to test all of them to ensure that the designated pattern have an exclusive stable parameter regime. Fortunately, it is unnecessary to do so because of previous results on the dynamics (Linsker 1986,1988), which show that there are only a few types of saturated attractors that are dominant for certain choices of parameters. With the same procedure shown in Figures 2 and 3, we have tested all possible variations of the other major eigenfunctions ”3s,” “3p,” and “3d“ mentioned in MacKay and Miller (1990a,b),and found that no one of them has a stable area in the (rc, ra) plane beyond the area where YE is small. It is not farfetched to predict that other patterns will be unlikely to have stable areas in the (rc, rB)plane shared with the major patterns, except for the area with small r5. Even if some patterns other than those we have studied do have overlapping stable regimes with the major patterns, the results from the linearization approach show that it is unlikely to observe the former in simulations, since the latter will be dominant. (d) We conclude our discussions with a brief description of the (kl, k2) parameter subspace. As illustrated in Figure 1, in the (kl, k2) subspace for the B-to-C dynamics, the presence or absence of a band corresponding to a structured aRF (Regime type E or F) is determined by the choices of (rc.ya). Clearly, the ranges of SDFs have to be chosen appropriately, so that a Regime E or F for a dominant structured aRF with a certain slope value may no longer coexist with any other major aRF. Next, the appropriate choices of the parameters kl and k2 to ensure the potential emergence of designated aRFs can be based on the following observations:
0
0
( k l , k2) must be in regime D: Only within Regime D, various Regime E‘s and Regime F’s corresponding to various stable structured aRFs will be removed from Regime A, B, C , and G where the all-excitatory and all-inhibitory aRFs are dominant or many kinds of attractors coexist. k2 must be large and negative: When k2 is chosen to be large and negative, these Regime E’s and Regime F’s with different slope values will be separate from each other in Regime D.
-kl/k2 FZ c(w):To ensure the potential occurrence of a desired structured aRF w, ( k l . k2) must be in the band corresponding to w [i.e., d2(w) > kl c(w)k2 > dl(w)], and if dl(w) z d2(w),then -kl/k2 is approximately equal to the average synaptic strength ~ ( w ) .
+
J. Feng, H. Pan, and V. P. Roychowdhury
1018
The preceding descriptions about the choices of kl and k2 have been also derived from the linearization approach in MacKay and Miller (1990a,b). The analysis in this paper, however, allows one to explore the whole parameter space, including the critical roles played by the ranges of SDFs. In practice, the parameter-space approach introduced in this paper and the conventional methods could be used in a complementary fashion. For example, the linearization method can be employed to identify a set of dominant patterns (see MacKay and Miller 1990a,b; Miller and Stryker 1989; Miller 1996). If two or more patterns do coexist for a given set of parameters, then the linearization method could be used to derive some information on the dynamical process of their occurrence. The method on the parameter space, on the other hand, can test the stability of any designated pattern for any given choice of parameters, and provide the stable and unstable parameter regimes for any set of patterns under consideration. This information could be used, for example, to resolve the coexistence among a set of dominant patterns (as demonstrated for Linsker’s model) and to choose parameters to ensure the potential occurrence of the designated patterns without performing extensive simulations.
Acknowledgments __ The work of V. P. Roychowdhury and H. Pan was supported in part by the General Motors Faculty Fellowship and by the NSF Grant ECS-9308814. J. Feng was partially supported by the A . v. HumboIdt Foundation. We thank the anonymous reviewers and Zili Liu for helpful comments on the manuscript.
References Anderson, J. A., Silverstein, J . W., Ritz, S. A., and Jones, R. S. 1977. Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psyclzolog. Rezi. 84, 413-451. Erwin, E., Obermayer, K., and Schulten, K. 1995. Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neirml Comp. 7, 425468. Feng, J., and Pan, H. 1993. Analysis of Linsker-type Hebbian learning: Rigorous results. Praccediiigs of the 2993 I€€€ ~ i i t e ~ ~ z f f ~Coilferc.uce ~ u ~ z n ~ 017 Nrirral NetzcJorks-%fz Francisco, pp. 111: 1516-1521. IEEE, Piscataway, NJ. Feng, J., Pan, H., and Roychowdhury,V. P. 1995a. A rigorous analysis of LinskerProcessing System type Hebbian learning. In A~fzlnncesiiz Nezrrnl Irfurr~zntiol~ 7, G. Tesauro, D. S.Touretzky, and T. K. Leen, eds., pp. 319-326. MIT Press, Cambridge, MA.
Neurodynamics with Limiter Function
1019
Feng, J., Pan, H., and Roychowdhury, V. P. 199513. Linsker-Type Hebbian Learning: A Qualitative Analysis on the Parameter Space. Tech. Rep. TR-EE 95-12. School of Electrical and Computer Engineering, Purdue University. Golden, R. M. 1986. The ”Brain-State-in-a-Box” neural model is a gradient descent algorithm. J. Math. Psychol. 30, 73-80. Golden, R. M. 1993. Stability and optimization analysis of the generalized BrainState-in-a-Box neural network model. J. Math. Psychol. 37, 282-298. Groles, E., and Martinez, S. 1990. Neural and Automata Networks: Dynamical Behavior and Applications, pp. 154-166. Kluwer Academic, Dordrecht, The Netherlands. Greenberg, H. J. 1988. Equilibria of the Brain-State-in-a-Box (BSB) neural model. Neural Networks 1,323-324. Hui, S., and Zak, S. H. 1992. Dynamical analysis of the Brain-State-in-a-Box (BSB) neural models. ZEEE Trans. Neural Networks 3(1), 86-94. Linsker, R. 1986. From basic network principle to neural architecture (series). Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 8779-8783. Linsker, R. 1988a. Development of feature-analyzing cells and their columnar organization in a layered self-adaptive network. In Computer Simulation in Brain Science, R. Cotterill, ed., pp. 416431. Cambridge University Press, Cambridge, MA. Linsker, R. 1988b. Self-organization in a perceptual network. Computer 21(3), 105-117. MacKay, D. J. C., and Miller, K. D. 1990a. Analysis of Linsker’s application of Hebbian rules to linear networks. Network 1, 257-297. MacKay, D. J. C., and Miller, K. D. 1990b. Analysis of Linsker’s simulations of Hebbian rules. Neural Cornp. 2, 17s187. Miller, K. D. 1990. Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Cornp. 2, 321-333. Miller, K. D. 1996. Receptive fields and maps in the visual cortex: Models of ocular dominance and orientation columns. In Models ofNeurul Networks III, E. Domany, J. L. van Hemmen, and K. Schulten, eds., pp. 55-78. SpringerVerlag, New York. Miller, K. D., and MacKay, D. J. C. 1994. The role of constraints in Hebbian learning. Neural Cornp. 6, 100-126. Miller, K. D., and Stryker, M. P. 1989. The development of ocular dominance columns: Mechanisms and models. In Connectionist Modeling and Brain Function: The Developinglnterfaces, S. J. Hanson and C. R. Olson, eds., pp. 255-350. MIT Press, Cambridge, MA. Tang, D. S. 1989. Information-theoretic solutions to early visual information processing: Analytic results. Phys. Rev. A 40, 6626-6635.
Received May 1, 1995; accepted January 5,1996.
This article has been cited by: 2. Hong Qu, Zhang Yi, Xiaobin Wang. 2009. A Winner-Take-All Neural Networks of N Linear Threshold Neurons without Self-Excitatory Connections. Neural Processing Letters 29:3, 143-154. [CrossRef] 3. Heiko Wersing , Wolf-Jürgen Beyn , Helge Ritter . 2001. Dynamical Stability Conditions for Recurrent Neural Networks with Unsaturating Piecewise Linear Transfer FunctionsDynamical Stability Conditions for Recurrent Neural Networks with Unsaturating Piecewise Linear Transfer Functions. Neural Computation 13:8, 1811-1825. [Abstract] [PDF] [PDF Plus] 4. Jianfeng Feng. 1998. Journal of Physics A: Mathematical and General 31:17, 4037-4048. [CrossRef] 5. Jianfeng Feng, David Brown. 1998. Journal of Physics A: Mathematical and General 31:4, 1239-1252. [CrossRef] 6. Jianfeng Feng , David Brown . 1998. Fixed-Point Attractor Analysis for a Class of NeurodynamicsFixed-Point Attractor Analysis for a Class of Neurodynamics. Neural Computation 10:1, 189-213. [Abstract] [PDF] [PDF Plus] 7. Jianfeng Feng. 1997. Lyapunov Functions for Neural Nets with Nondifferentiable Input-Output CharacteristicsLyapunov Functions for Neural Nets with Nondifferentiable Input-Output Characteristics. Neural Computation 9:1, 43-49. [Abstract] [PDF] [PDF Plus] 8. Jianfeng Feng, David Brown. 1996. A novel approach for analyzing dynamics in neural networks with saturated characteristics. Neural Processing Letters 4:1, 9-16. [CrossRef] 9. J Feng, K P Hadeler. 1996. Journal of Physics A: Mathematical and General 29:16, 5019-5033. [CrossRef]
Communicated by Geoffrey Goodhill
Effect of Binocular Cortical Misalignment on Ocular Dominance and Orientation Selectivity Hare1 Shouval Nathan Intrator' C. Charles Law Leon N Cooper Departments of Physics and Neuroscience and The Institutef o r Brain and Neural Systems, Box 1843, Brown University, Providence, RI 02912 U S A
We model a two-eye visual environment composed of natural images and study its effect on single cell synaptic modification. In particular, we study the effect of binocular cortical misalignment on receptive field formation after eye opening. We show that binocular misalignment affects principal component analysis (PCA) and Bienenstock, Cooper, and Munro (BCM) learning in different ways. For the BCM learning rule this misalignment is sufficient to produce varying degrees of ocular dominance, whereas for PCA learning binocular neurons emerge in every case. 1 Introduction
It is now generally accepted that receptive fields in the visual cortex of cats are dramatically influenced by the visual environment (for a comprehensive review see Frkgnac and Imbert 1984). In normally reared animals, the population of sharply tuned neurons increases monotonically, whereas for dark reared animals it initially increases, but then almost disappears (see, for example, Imbert and Buisseret 1975). Ocular dominance is dramatically influenced by such manipulations as monocular deprivation (Wiesel and Hubel 1963) or reverse suture (Blakemore and van Sluyters 1974; Mioche and Singer 1989). It has even been shown that preferred orientations can be directly altered by pairing the preferred orientation with a negative current, and the nonpreferred orientation with a positive current (Fregnac et al. 1992). The issue of cortical input misalignment and its relation to receptive field development has been studied by Pettigrew (1974). He has found that in area 17 of young kittens, there is a large misalignment between the receptive fields of the two eyes, of as much as 30". The misalign*Also at Faculty of Exact Sciences Tel-Aviv University, Tel-Aviv, Israel.
Neural Computation 8, 1021-1040 (1996) @ 1996 Massachusetts Institute of Technology
1022
H. Shouval et al.
ment in normally reared kittens decreases to an average of 5"' within 5 weeks from birth, and remains around that level through adulthood (Nikara et al. 1968; Joshua and Bishop 1970). Binocularly deprived kittens retain a high degree of misalignment. Furthermore, the reduction in misalignment occurs concurrently with the development of orientation selectivity, ocular dominance, and disparity tuning (Movshon and van Sluyters 1981). Thus, it follows that modeling of receptive field development should take into account this misalignment. Blakemore and van Sluyters (1975) suggested that plasticity may be needed in animals with binocular vision to overcome a developmental misalignment between cortical inputs. van Sluyters (1977) and van Sluyters and Levitt (1980) tested the effect of prismatic deviation between the eyes. They found that with a small deviation, kittens showed normal binocularity, while larger deviations reduced binocularity. The relationship between ocular dominance orientation selectivity and disparity was studied in cats by Levay and Voigt (1988).They found that no relationship could be established between the cell's best orientation and ocular dominance or any aspect of its disparity tuning. There was also no relation between ocular dominance and the sensitivity to disparity, while ocular dominance and best disparity were related; binocular cells were mostly zero disparity tuned, while more monocular cells were tuned to a broader distribution of best disparity. Different models attempting to explain how cortical receptive fields evolve have been proposed over the years (von der Malsburg 1973; Nass and Cooper 1975; Perez et al. 1975; Sejnowski 1977; Bienenstock et al. 1982; Linsker 1986; Miller 1994). Such models are composed of several components: the exact nature of the learning rule, the representation of the visual environment, and the architecture of the network. Most of these models assume a simplified representation of the visual environment (e.g., von der Malsburg 1973), or replace the visual environment by a second-order correlation function (Miller 1994). A variant of Hebbian learning rule with subtractive decay and a visual environment represented by a second-order correlation matrix has been shown to achieve monocular receptive fields (Miller et al. 1989). With a different set of parameters, this learning rule develops orientationselective cells as well (Miller 1994). Dayan and Goodhill (1992) have shown that with a uniform positive correlation between corresponding pixels in the left and right eye, only binocular cells emerge. Berns et nl. (1993) performed simulations using correlations between a simplified one-dimensional input to both eyes. They were able to get deviations from totally binocular cells using two phases of learning and sticky saturation bounds. In the first learning phase (prenatal) there was no correlation between the eyes; in the second phase (postnatal) the two eyes were positively correlated. During the first phase monocular receptive fields 'Some examples near the fovea leg., Fig. 3 of Pettigrew (1974)Jshow a much smaller misalignment, which is smaller than the receptive field size.
Binocular Cortical Misalignment
1023
developed, and did not become totally binocular, in the second phase, due to the saturation bounds. A recent paper by Erwin et al. (1995) compared the predictions of several cortical plasticity models to experimental results. This comparison applied a different visual environment for each model, mostly of a simplified low dimensional type, or a type described only by a secondorder correlation function, but did not use an environment composed of natural images. They have found that some of the symbolic input models simultaneously developed orientation selectivity and varying degrees of ocular dominance. No misalignment between two eyes, or between cortical inputs, was considered directly. Realistic representations of the visual environment have only very recently been considered (Hancock et aZ. 1992; Law and Cooper 1994; Liu and Shouval 1994; Shouval and Liu 1996), and only in recent years have the statistics of natural images been studied and used for predicting receptive field properties (Field 1987, 1989; Baddeley and Hancock 1991; Atick and Redlich 1992; Ruderman and Bialek 1994; Liu and Shouval 1994; Shouval and Liu 1996). Baddeley and Hancock (1991) performed simulations of the principal component (PC) learning rule in a visual environment composed of natural images. They have found that the first PC is radially symmetric and that the second one is orientation selective and horizontal. This is due to a slight bias to the horizontal direction in the correlation function of natural images. Liu and Shouval (1994) analyzed this situation and have shown that it depends on the scaleinvariant nature of the power spectrum of the natural images (Field 1987). They have also analyzed the situation with retinal preprocessing (Shouval and Liu 1996). Due to the preprocessing, the first principal component may become oriented, however, it was always found to be horizontal. Recently the same type of realistic environment was used for simulations of the BCM rule (Law and Cooper 1994). They have found receptive fields selective to all orientations. Once actual visual scenes are used, it is possible to realistically represent two-eye input, and account for the fact that the two eyes are not looking at exactly the same visual scene. For example, Li and Atick (1994) used natural images to extract detailed two-eye power spectra from stereo images. They have used these results to predict properties of cortical receptive fields. In this paper, we study the effect of a fixed synaptic density (arbor function) misalignment between the cortical inputs coming from both eyes, on the development of ocular dominance and orientation selectivity. This type of misalignment may be caused by an imprecise developmental alignment of the arbor functions from both eyes. We compare the outcomes that result from two learning rules: PCA ( q a 1982) and BCM (Bienenstock et al. 1982; Intrator and Cooper 1992). We have chosen to examine these two because they are well defined, and have stable fixed points. Many other proposed learning rules (Sejnowski
H. Shouval et al.
1024
1977; Linsker 1986; Miller et a/. 1989, for example) are closely related to the PCA rule. Their outcome depends only on first- and secondorder statistics. The BCM rule, in contrast, also depends on third-order statistics. We show that binocular misalignment affects these two learning rules in a different manner. For the BCM learning rule, misalignment is sufficient to produce varying degrees of ocular dominance, whereas for the PCA learning rule, binocular neurons emerge independent of the misalignmen t.
2 A Binocular Visual Environment Composed of Natural Images
__
We have used a set of 24 natural scenes. These pictures were taken at Lincoln Woods State Park, scanned into a 256 x 256 pixel image. No corrections have been used for the optical distortions of the instruments. We have avoided man-made objects, because they have many sharp edges, and straight lines, which make it easier to achieve oriented receptive fields. In this paper, we limit ourselves to study receptive field formation near the fovea, and thus do not model the change in resolution that corresponds to the complex log retinotopic mapping. The anatomy of the visual pathway is such that light that falls on the retina is encoded by the receptors. The signal is then processed by the retinal circuitry and is projected by the ganglion cells onto the optic nerve. Signals from both eyes cross at the optic chiasm and continue to the lateral geniculate nucleus (LGN). In the LGN inputs from the two eyes are segregated in different layers. From the LGN, signals project up to the visual cortex. The receptive fields of both ganglion cells and LGN projections have a center-surround shape (Fig. 1). Some are ”on center,” which means that they are excited by spots of light falling on their centers, and inhibited by light on the surround; others are “off center” and are excited by light falling on their surround, and inhibited by light in their center. In the cat, the most abundant type near the fovea are the X cells. They are linear, i e , their response to an image that is composed of several Components is roughly the sum of the response to the independent components (Orban 1984).
The effect of the retinal preprocessing is modeled by convolving the images with a difference of gaussians (DOG) filter, with a center radius of one pixel (“1 = 1.0) and a surround radius of three (“2 = 3).2 The effect of this preprocessing is shown in Figure 1. -’Thisratio between the center and surround Cugrll and Robson 1966)
IS
biologicallv plausible (e g , Enroth-
Binocular Cortical Misalignment
1025
Figure 1: Three of the natural images used (top) processed by a difference of gaussians filter (middle) are shown (bottom). As illustrated in Figure 2, the input vectors from both eyes are chosen as small, partially overlapping, circular regions of the preprocessed natural images; these converge on the same cortical cell. The input from the right and left eye respectively are denoted by d1 and d', and the output of the cortical neuron then becomes
c = g(d' . m' + d' . m')
(2.1)
where (T is the nonlinear activation function of each neuron and m', m' are the left and right synaptic weight vectors, respectively. We have used a nonsymmetric activation function to account for the fact that neuronal
H. Shouval et al.
1026
Binocular Model R i
Visual pathway
CorticaX neuron
right
R.F
Figure 2: Schematic diagram of the two eye model, including the visual input preprocessing. The overlap parameter 0 is defined as 0 = s/2a. When 0 = 1 the receptive fields are completely overlapping, when 0 = 0 they are nonoverlapping but touching; 0 < 0 means that they are nonoverlapping and not touching.
activity as measured from spontaneous activity has a longer way to go up than to go down to zero activity. In order to examine the effect of varying the overlap between the receptive fields we define an overlap parameter 0 = s/2a, where a is the receptive field radius in pixels, and s is the linear overlap in pixels, as shown in Figure 2. 0 = 1 when the left and right receptive fields are completely overlapping, and 0 5 0 when the receptive fields are completely separated. We are interested in the dependency between ocular dominance and the degree of misalignment between the left and right eye. First, we measure the maximal response of the left and right eye termed L and R. This is done by finding the optimal spatial grating frequency, and then finding the response at the optimal orientation with this grating for each eye. We consider the following ocular dominance measure B, which is based on the left and right eye responses:
t-R B=----L t R This measure has been motivated by that used by Albus (1975) in
Binocular Cortical Misalignment
1027
defining the bin boundaries for a seven bin ocular dominance hi~togram.~ Our bin boundaries are given by -0.85, -0.5, -0.15,0.15.0.5,0.85 3 Cortical Plasticity Learning Rules
We have employed these realistic visual inputs to test two of the leading visual cortical plasticity rules: principal components analysis (PCA) and the Bienenstock, Cooper, and Munro (BCM) model. The two algorithms differ by their information extraction properties as discussed in Intrator and Cooper (1992); PCA extracts second-order statistics from the visual environment, while BCM extracts information contained in third-order statistics as well. There are other Hebbian learning rules that are related to PCA. These models may produce somewhat different results, but are not studied here. 3.1 Principal Components Analysis. PCA is one of the most widely used feature extraction method for pattern recognition tasks. PCA features are those orthogonal directions that maximize the variance of the projected distribution of the data. They also minimize the mean squared error between the data and a linearly reconstructed version of it based on these projections. Principal components are optimal when the goal is to accurately reconstruct the inputs. They are not necessarily optimal when the goal is classification and the data are not normally distributed (see, for example, p. 212, Duda and Hart 1973). A simple interpretation of the Hebbian learning rule is that with appropriate stabilizing constraints, it leads to the extraction or approximation of principal components. This has often been modeled (see, for example, von der Malsburg 1973; Sejnowski 1977; Oja 1982; Linsker 1986; Miller et al. 1989). The learning rule that we use was proposed by Oja (1982), and has the form
arni = q[d,c- C*rn,]
(3.1)
where d, is the presynaptic activity at synapse i, c is the postsynaptic activity, and rn, is the strength of the synaptic efficacy of junction i. 7 is a small learning rate. This learning rule has been shown to converge to the principal component of the data. 3.2 BCM Learning Rule. The BCM theory (Bienenstock et al. 1982) was introduced to account for the striking dependence of the sharpness of orientation selectivity on the visual environment. We shall be using a ~~
3Since there is always some activity from both eyes, we have extended bin 1 and 7 slightly with respect to those used by Albus.
1028
H. Shouval et al.
variation, due to Intrator and Cooper (1992), for a nonlinear neuron with a nonsymmetric sigmoidal transfer function. Using the above notation, synaptic modification is given by (3.2)
ttt, = ~ / o ( c(3M)dl .
where the neuronal activity is given by c = a ( m . d ) ,o(c.(3M) = c(c- ( 3 ~ ) , and (k),M is a nonlinear function of some time averaged measure of cell activity, which in its simplest form is (3.3) where E denotes the expectation over the visual environment. The transfer function CT is nonsymmetric around 0 to account for the fact that cortical neurons show a low spontaneous activity, and can thus fire at a much higher rate relative to the spontaneous rate, but can go only slightly below it.4 We have tested several sigmoidal functions to verify the important features needed for robust results. It has been shown (Intrator and Cooper 1991) that this version of the BCM learning rule leads to a neuron that seeks multimodality in the projected distribution (rather than simply maximizing the variance) and is thus suitable for finding clusters in high dimensional space. Simulations using this learning rule were found to be in agreement with the many experimental results on visual cortical plasticity (Clothiaux et al. 1991; Law and Cooper 1994). A network implementation of this neuron, which can find several projections in parallel while retaining its computational efficiency, was found applicable for extracting features from very high-dimensional vector spaces (Intrator et al. 1991; Intrator 1992). We have used the modification rule used by Law and Cooper (1994) with a variable learning rate:
which produced faster convergence with qualitatively similar results. 4 Results
For the results reported here we used fixed circular receptive fields with diameters of 20 pixels. We tested the robustness of the results to receptive fields of sizes 10 to 30 pixels and found no qualitative difference. 4.1 Completely Overlapping Receptive Fields. With completely overlapping receptive fields, BCM neurons develop various orientation preferences, all highly selective. A typical example of such receptive fields and orientation selectivity is presented in Figure 3. A less typical 'The actual sigma used in the simulations is (e' - e ' ) / ( O le'
+3
~ ' )
Binocular Cortical Misalignment
1029
Figure 3: (Top)Receptive fields for a BCM neuron with completely overlapping inputs. Tuning curves are selective and similar in both eyes. (Bottom)Receptive fields for completely overlapping inputs using the PCA rule.
result would be a slight ocular preference with high orientation selectivity. It should be noted that high orientation selectivity is obtained for a single neuron with no need to introduce lateral inhibition between neurons; cells produce receptive fields, of all orientations? similar to simple cell receptive fields observed in visual cortex (Jones and Palmer 1987; see Kandel and Schwartz 1991, for review). These results are in sharp contrast to those of a PCA neuron. PCA neurons developed receptive fields with horizontal orientation selectivity only (Fig. 3). Orientation selectivity in PCA neurons depends on the retinal preprocessing. When PCA neurons are trained on raw natural images, the dominant solution is radially symmetric (Liu and Shouval 1994). However, when retinal preprocessing is included, oriented solutions can be attained (Shouval and Liu 1996). The strong preference to the horizontal direction is due to a slight bias in the correlation function of natural images to this direction (Baddeley and Hancock 1991; Shouval and Liu 1996). If the images are rotated by an angle B then so are the preferred orientations of the PCA neurons. The results hold for a linear and nonlinear (sigmoidal)neuron. The results are also invariant to a sign change (weight vectors m and -m are eigenvectors). 5Although all orientations are represented, the probability of attaining different orientations may differ (Law and Cooper 1994).
1030
H. Shouval et al.
Figure 4: Receptive fields with a small overlap (0 = 0.2) using the BCM rule. Results vary from fully binocular cells with a moderate orientation tuning (top) to less binocular cells with well-defined receptive fields as well as high orientation selectivity in one eye (middle), and finally, monocular highly selective cells (bottom). 4.2 Partially and Nonoverlapping Receptive Fields. BCM neurons acquire selectivity to various orientations in both the partial and the nonoverlapping cases. When cortical inputs are misaligned, various ocular dominance preferences may occur (Fig. 4) even for the same overlap. This result stands in sharp contrast to the one obtained by PCA neurons; only binocular neurons emerge for the PCA rule. The BCM receptive field formation results are summarized in Figures 5 and 6. Cortical inputs misalignment does not affect orientation selectivity of the dominant eye, but does produce varying degrees of ocular dominance, depending upon the degree of overlap between the receptive fields. The main result is that ocular dominance (even for single cell simulations) depends strongly on the degree of overlap between visual input to the two eyes. Figure 6 presents a box-plot summary of 700 runs showing the dependence of ocular dominance on visual input overlap. It is evident that binocularity as well as the spread of ocular dominance increase as the degree of overlap increases.
Binocular Cortical Misalignment
1031
Figure 5: BCM neurons with different overlap values; 0 = 1,0.6,0.2.-0.2 from top to bottom. The ocular dominance histograms summarize the ocular dominance of 100 cells at each overlap value. The dependence of ocular dominance on visual overlap is evident.
The PCA results for partially overlapping receptive fields are presented in Figure 7. As mentioned above, it can be seen that the degree of overlap between receptive fields does not alter the optimal orientation, so that whenever a cell is selective its orientation is in the horizontal direction. The degree of overlap does affect the shape of the receptive fields and the degree of orientation selectivity that emerges under PCA; orientation selectivity decreases as the amount of overlap decreases. However, when there is no overlap at all, one again obtains higher selectivity. For PCA, there is a symmetry between the receptive fields of both eyes that
H. Shouval et al.
1032
0
-j
n
I
0
4
8
12
16
20
m
24
Shift in pixels
Figure 6: Dependence of ocular dominance on visual input overlap for BCM learning: Binocularity increases when overlap increases. This box plot shows how the distribution of lBl changes with different overlap values between the inputs from the two eyes. The values were obtained from 100 runs at each overlap value. The shaded areas are first and third quartiles, the brackets represent three quartile range, and any additional lines represent single outliers.
imposes binocularity. This arises from the invariance of the two-eye correlation function to a parity transformation (see Appendix). We also studied the possibility that under the PCA rule different orientation-selective cells would emerge if the misalignment between the cortical inputs was in the vertical direction. This tests the effect of a shift orthogonal to the preferred orientation. The results show that there is no change in the orientation preference; even in this case only horizontal receptive fields emerge. The PCA results described above were quite robust to introduction of nonlinearity in a cell's activity. There was no qualitative difference in the results when a nonsymmetric sigmoidal transfer function was used.
Binocular Cortical Misalignment
1033
Figure 7 Receptive fields for partially overlapping inputs using the PCA rule. Receptive field for an overlap value of 0 = 0.6 (top left). Receptive field for a small overlap, 0 = 0.2 (middle left). Receptive field for no overlap, 0 = -0.2 (bottom left). Receptive field for shift in the vertical direction between the visual inputs when 0 = 0.5 (top right). Receptive field for shift at 36", 0 = 0.5 (middle right). Receptive field for images that were rotated by 45", 0 = 0.5 (bottom right). In all cases the cell is binocular and horizontal. The symmetry property evident in these receptive fields is analyzed in the Appendix.
5 Conclusions
In this paper we study whether unsupervised learning can produce both orientation selectivity and varying degrees of ocular dominance with the same set of assumptions about the visual environment. We use a visual environment composed of preprocessed natural images. The two eyes view portions of the same image and we test how varying the degree of overlap affects ocular dominance. The PCA and BCM learning rules were chosen since they are representative of rules based on second- and third-order statistics; they have been used before with natural images environment, and have stable fixed
1034
H. Shouval et al.
points. Using the same visual environment, we have shown that misalignment in the synaptic density function generates various degrees of ocular dominance under a BCM rule, but fails to do so under a PCA rule, where only binocular cells emerge. This makes the BCM rule more consistent with biological findings (Hubel and Wiesel 1962). We have also shown that while this misalignment does not prevent cells from becoming selective to all directions under a BCM rule, it is unable to overcome the slight horizontal bias in natural images so that only horizontal selective cells emerge under a PCA rule. This is especially surprising when the misalignment is not in the horizontal direction. The result of the BCM model are in agreement with biological findings; oriented receptive fields of various degrees of binocularity emerge (Hubel and Wiesel 1962; van Sluyters and Levitt 1980; Orban 1984) and, in general, the degree of misalignment in mature receptive fields is smaller than that between the arbor functions (the original cortical input misalignment); it has been observed that the degree of receptive field misalignment is reduced by normal rearing (Pettigrew 1974). The more binocular BCM neurons are, the smaller the misalignment between their RF centers. In contrast, the displacement of cortical inputs, with single cell PCA learning, always leads to binocular and horizontal cells. The result about the first PC being horizontal is in agreement with previous theoretical results (Hancock ef al. 1992; Shouval and Liu 1996). The misalignment between RF centers is, in this case, close to zero for overlapping RFs and large for nonoverlapping RFs. Experiments with vertical prismatic deviation between the eyes as well as experiments with horizontal surgical and optical strabismic kittens (van Sluyters and Levitt 1980)correspond directly to our simulations with a shift between receptive fields in the vertical and horizontal direction (Figs. 4-7). Experiments with a vertical displacement are most similar to our simulations, since, naturally, the two eyes are not displaced on the vertical axis and one expects that the animal would have less ability to correct for displacement in the vertical direction. The findings of these experiments, that with a small deviation kittens show normal binocularity while larger deviations reduce binocularity, give support to the idea that displacement may indeed be the origin of the varying degrees of binocularity. These results are in agreement with the BCM result, and not with the PCA result. A cell that is tuned to a nonzero disparity responds optimally to stimuli that fall either in front or behind the focal plane. This corresponds to a shift of the optical axis of one eye with respect to the other eye. Thus, one might expect that by displacing the cortical receptive fields of the two eyes, disparity tuned neurons would emerge, and the disparity would correspond to the degree of displacement. Surprisingly, this is not the case for both BCM and PCA neurons; our simulations show that when cells develop binocular receptive fields, they maximally respond to stimuli in the overlapping region of the arbor functions. In that re-
Binocular Cortical Misalignment
1035
gion, their maximal response turns out to be on the focal plane (zero disparity) (Figs. 5 and 7). The more monocular BCM neurons also tend to develop a small nonzero disparity. The results for BCM neurons are in agreement with biological findings (LeVay and Voigt, 1988) and with a recently introduced disparity tuning model (Berns et al. 1993). Despite this agreement with some disparity tuning results, it is important to remember that our simulations do not attempt to model disparity tuning. A model for the development of disparity tuning has to account for the fact that in a realistic 3D visual environment, objects that lie in front of the focal point would be shifted in the opposite direction to objects that lie behind the focal point. Thus, the statistics of 3D stereo images are far more complex than what one would use in a 2D model. Such a representation of stereo images has recently been used by Li and Atick (1994). We believe that the fact that cells typically emerge with zero disparity tuning in our simulations is due to the lack of stereo information in our images. We expect that for PCA neurons, stereo information will not alter the fundamental result that PCA neurons are binocular (see the analysis in the Appendix). However, it remains to be seen how the distribution of best disparity will be altered. In an analytic study of the symmetry properties of the eigenstates of a misaligned two-eye problem, we show that the binocularity of PCA neurons results from the invariance of the two eye correlation function to a parity transformation.6
Appendix: Symmetry Properties of the Eigenstates of the Two Eye Problem The evolution of neurons in a binocular environment under the PCA learning rule, according to equation A.l, reaches a fixed point when
Qm = Am
(A.1)
where mT = (m':mr),the left and right eye synaptic strengths; the twoeye correlation function Q has the form
where Qn and Qrr are the correlation functions within the left and right eyes, Qlr and Qrl are the correlation functions between the left-right and right-left eyes. We denote by upper case R's the coordinates in each receptive field with respect to a common origin, and by lower case r's the coordinates from the centers of each of the receptive fields. Thus, Rol 61t is important to note that this symmetry applies only to single principal components, but not to combinations of several PCs. In the model proposed recently by Erwin and Miller (1995) the final states are combinations of several PCs.
H. Shouval et al.
1036
Figure 8: Coordinates for the two eyes. For a shift 5 between the two eyes, Rol + s = Ror. Therefore R, - Ri = R", - Rol rr - rl = s r, - rl.
+
+
and Ror are the coordinates of the centers of the left and right eyes, R1 and R, are the coordinates of points in both receptive fields, and r1 and r, are the coordinates of the same points with respect to the centers of the left and right receptive field centers. For a misalignment s between receptive field centers, Rol+ s = Ro,. Therefore, R, - XI = Ror - Rol+ rr - rl = s r, - rl (see Fig. 8). Using translational invariance, it is easy to see that
+
QII
=
E I ~ ~ I M ( <=) IQir
-
I-')
Binocular Cortical Misalignment
1037
+
+
Qm = E [d(r,)d(<)]= E [d(s s ) d ( 4 s)] = Q(r - r‘) Qlr = E [d(r~)d(ri)] = E [d(rI)d(r; s)] = Q(r - r‘ s) Qrl = E [d(rr)d(4)]= E [d(s s)d(ri)]= Q(r - r’ - s) where E denotes an average with respect to the environment and where, occasionally, for simplicity, we replace r1 by r. Since Q(r-r‘) = E (d(rl)d(<)) then Q(r - r‘) = Q(r’ - r). We now introduce a two-eye parity operator P, which inverts the coordinates as well as the two eyes: rl (-rl) P: rr + (-it) (-4.3) s =+ (-s) It follows that under P, RI - R, = r, - rl s + -r, r1 - s. The two-eye parity operator can also be written in matrix form in terms of the one-eye parity operator P, thus
+
+
+
*
{
+
+
The effect P has on the two-eye receptive fields m is p [““r”] = [““-‘I)] mr(rr) ml(-rr) Any correlation function that is invariant to a two-eye parity transm‘(rl)], which are also formation P has eigenfunctions mT(r) = [wzl(rr)? eigenfunctions of P. This imposes symmetry constraints on the resulting receptive fields that force them to be binocular. Any correlation function7of the form Q ( r - r‘ s) Q(r - r’) (A.5) Q= Q ( r - r‘ - s) Q(r - r‘) is invariant to the two-eye parity transform P (that is PQP = Q), as long as Q(x) = Q(-x) and Q ( x ) = Q(-x). Thus the eigenfunctions of Q are also eigenfunctions of P. The eigenvalue is fl,since P2 = 1. Therefore we deduce that
[
+
1
Thus
This means that the receptive fields for the two eyes are inverted versions of each other up to a sign. Therefore for this learning rule the receptive fields are always perfectly binocular. 7This class includes the type of correlation function described in equation A.2, as well as the type postulated by Miller et nl. (1989), in which s = 0 and Q = qQ. There, monocularity for nonoriented receptive fields is attained by choosing 7 < 0 and restricting weights to be positive.
1038
H. Shouval et al.
Acknowledgments The authors thank the members of Institute for Brain and Neural Systems for many fruitful conversations. This research was supported by the Charles A. Dana Foundation, the Office of Naval Research, a n d the National Science Foundation.
References Albus, K. 1975. Predominance of monocularly driven cells in the projection area of the central visual field in cat’s striate cortex. Brain Res. 89, 341-347. Atick, J. J., and Redlich, A. N. 1992. What does the retina know about natural scenes? Neural Comp. 4, 19G210. Baddeley, R., and Hancock, P. 1991. A statistical analysis of natural images matchs psychophysically derived orientation tuning curves. Proc. X. SOC.B 246(17), 219-223. Berns, G. S., Dayan, I?, and Sejnowski, T. J. 1993. A correlational model for the development of disparity selectivity in visual cortex that depends on prenatal and postnatal phases. Proc. Natl. Acnd. Sci. U.S.A. 90, 827778281, Bienenstock, E. L., Cooper, L. N., and Munro, I? W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. 1. Neurosci. 2, 3248. Blakemore, C., and van Sluyters, R. R. 1974. Reversal of the physiological effects of monocular deprivation in kittens: further evidence for sensitive period. 1. Physiol. Londoiz 248, 66s716. Blakemore, C., and van Sluyters, R. R. 1975. Innate and environmental factors in the development of the kitten’s visual cortex. 1. Physiol. 248, 663-716. Clothiaux, E. E., Cooper, L. N., and Bear, M. F. 1991. Synaptic plasticity in visual cortex: Comparison of theory with experiment. J. Neurophysiol. 66, 1785-1804. Dayan, P., and Goodhill, G. 1992. Perturbing Hebbian rules. In Adz~ancesin Neural Information Processing S y s t e m 4 . Morgan Kaufmann, San Mateo, CA. Duda, R. O., and Hart, P. E. 1973. Pattern Classifcation and Scene Analysis. John Wiley, New York. Enroth-Cugell, C., and Robson, J. 1966. The contrast sensitivity of retinal ganglion cells of the cat. 1. Physiol. 187, 517-522. Erwin, E., and Miller, K. D. 1995. Modeling joint development of ocular dominance and orientation maps in primary visual cortex. In Proceedings of the Computation and Neirral Systems. Erwin, E., Obermayer, K., and Schulten, K. 1995. Models of orientation and ocular dominace in visual cortex. Neural Comp. 7(3), 425468. Field, D. 1. 1987. Relations between the statistics of natural images and the response properties of cortical cells. J. Optical SOC.Am. 4, 2379-2394. Field, D. J. 1989. What the statistics of natural images tell us about visual coding. SPlE 1077, 269-276.
Binocular Cortical Misalignment
1039
Fregnac, Y., and Imbert, M. 1984. Development of neuronal selectivity in primary visual cortex of cat. Physiol. Rev. 64, 325434. Frkgnac, Y., Thorpe, S., and Bienenstock, E. L. 1992. Cellular analogs of visual cortical epigenesis. I. Plasticity of orientation selectivity. J. Neurosci. 12(4), 1280-1300. Hancock, P. J., Baddeley, R. J., and Smith, L. S. 1992. The principal components of natural images. Nefwurk 3, 61-70. Hubel, D. H., and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106-154. Imbert, M., and Buisseret, P. 1975. Receptive field characteristics and plastic properties of visual cortical cells in kittens reared with or without visual experience. Exp. Brain Res. 22, 25-36. Intrator, N. 1992. Feature extraction using an unsupervised neural network. Neural Comp. 4, 98-107. Intrator, N., and Cooper, L. N. 1992. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks 5, 3-17. Intrator, N., and Cooper, L. N. 1995. Information theory and visual plasticity. In The Handbook of Brain Theory and Neural Networks, M. Arbib, ed., pp. 484-487. MIT Press, Cambridge, MA. Intrator, N., Gold, J. I., Biilthoff, H. H., and Edelman, S. 1991. Three-dimensional object recognition using an unsupervised neural network: Understanding the distinguishing features. In Proceedings of the 8th lsraeli Conference on AICV, Y. Feldman and A. Bruckstein, eds., pp. 113-123. Elsevier, Amsterdam. Jones, J. P., and Palmer, L. A. 1987. The two-dimensional spatial structure of simple receptive fields in cat striate cortex. JNP 58(6), 1187-1258. Joshua, D. E., and Bishop, P. 0. 1970. Binocular single vision and depth discrimination: Receptive field disparities for central and peripheral vision and binocular interaction or peripheral single units in cat striate cortex. Exp. Brain Res. 10, 389416. Kandel, E. R., and Schwartz, J. H., eds. 1991. Principles ofNeural Science. Elsevier, New York. Law, C. C., and Cooper, L. N. 1994. Formation of receptive fields in realistic visual environments according to the BCM theory. Proc. Natl. Acad. Sci. U.S.A. 91, 7797-7801, LeVay, S., and Voigt, T. 1988. Ocular dominance and disparity coding in cat visual cortex. Visual Neurosci. 1, 395414. Li, Z., and Atick, J. J. 1994. Efficient stereo coding in the multiscale representation. Network 5, 157-174. Linsker, R. 1986. From basic network principles to neural architecture (series). Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 8779-8783. Liu, Y., and Shouval, H. 1994. Localized principal components of natural images-an analytic solution. Network 5.2, 317-325. Miller, K. D. 1994. A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between on- and off-center inputs. J. Neurosci. 14, 409441.
1040
H. Shouval et al.
Miller, K. D., Keller, J. B., and Striker, M. I? 1989. Ocular dominance column development: Analysis and simulation. Scierzsc 245, 605-615. Mioche, L., and Singer, W. 1989. Chronic recording from single sites of kitten striate cortex during experience-dependent modification of synaptic receptivefield properties. /. N m q d y s i o l . 62, 185-197. Movshon, J. A,, and van Sluyters, R. C. 1981. Visual neural development. Annu. R~v.PSyChol. 32, 477-522. Nass, M. N., and Cooper, L. N. 1975. A theory for the development of feature detecting cells in visual cortex. Bid. Cyberrz. 19, 1-18. Nikara, T., Bishop, P. O., and Pettigrew, J. D. 1968. Analysis of retinal correspondence by studying single units in cat striate cortex. E q ) . Brniri RES. 6, 353-372. Qa, E. 1982. A simplified neuron model as a principal component analyzer. Mnth. Aiol. 15, 267-273. Orban, G. A. 1984. Neiiroiini Operotions ill the V m o l Cortcx. Springer-Verlag, Berlin. Perez, R., Glass, L., and Shlaer, R. J. 1975. Development of specificity in the cat visual cortex. /. Mnth. B i d . 1, 275. Pettigrew, J. D. 1974. The effect of visual experience on the development of stimulus specificity by kitten cortical neurons. /. Plnysiol. 237, 49-74. Ruderman, D. L., and Bialek, W. 1994. Statistics of natural images: Scaling in the woods. In Ailwises iri Neirrnl [r!forinntioii Proctssirig Systmzs 6, J. D. Cowan and J . Alspector, eds. Morgan Kaufmann, San Mateo, CA. Sejnowski, T. J. 1977. Storing covariance with nonlinearly interacting neurons. I. Mnth. B i d . 4, 303-321. Shouval, H., and Liu, Y. 1996. Principal component neurons in a realistic visual environment. Netiiiork, in press. van Sluyters, R. 1977. Artificial strabismus in the kitten. Irzuest. Ophthnlrnol. Vis. Sci. S u p p l . 16, 40. van Sluyters, R. C., and Levitt, F. 8. 1980. Experimental strabismus in the kitten. 1. Neiirophysiol. 43, 689499. von der Malsburg, C. 1973. Self-organization of orientation sensitivity cells in the striate cortex. K!/herizt,tik 14, 85-100. Wiesel, T. N., and Hubel, D. H. 1963. Single-cell responses in striate cortex of kittens deprived of vision in one eye. I. Neinqdi!ysiol. 26, 1003-1017.
Received April 24, 1995; accepted January 8, 1996.
This article has been cited by: 2. A. Bazzani , D. Remondini , N. Intrator , G. C. Castellani . 2003. The Effect of Noise on a Class of Energy-Based Learning RulesThe Effect of Noise on a Class of Energy-Based Learning Rules. Neural Computation 15:7, 1621-1640. [Abstract] [PDF] [PDF Plus] 3. Brian Blais , Leon N. Cooper , Harel Shouval . 2000. Formation of Direction Selectivity in Natural Scene EnvironmentsFormation of Direction Selectivity in Natural Scene Environments. Neural Computation 12:5, 1057-1066. [Abstract] [PDF] [PDF Plus]
Communicated by Francoise Fogelman-Soulie
A Novel Optimizing Network Architecture with Applications Anand Rangarajan Department of Diagnostic Radiology, Yale University, New Haven, CT 06520-8042 USA
Steven Gold Department of Computer Science, Yale University, New Haven, CT 06520-8285 USA
Eric Mjolsness Department of Computer Science and Engineering, University of Califarnia San Diego (UCSDI, La lolla, C A 92093-0214 USA
We present a novel optimizing network architecture with applications in vision, learning, pattern recognition, and combinatorial optimization. This architecture is constructed by combining the following techniques: (1)deterministic annealing, (2) self-amplification, (3) algebraic transformations, (4) clocked objectives, and (5) softassign. Deterministic annealing in conjunction with self-amplification avoids poor local minima and ensures that a vertex of the hypercube is reached. Algebraic transformations and clocked objectives help partition the relaxation into distinct phases. The problems considered have doubly stochastic matrix constraints or minor variations thereof. We introduce a new technique, softassign, which is used to satisfy this constraint. Experimental results on different problems are presented and discussed. 1 Introduction
Optimizing networks have been an important part of neural computation since the seminal work of Hopfield and Tank (Hopfield and Tank 1985). The attractive features of these networks-intrinsic parallelism, continuous descent inside a hypercube, ease in programming, and mapping onto analog VLSI-raised tremendous hopes of finding good solutions to many "hard combinatorial optimization problems. The results (for both speed and accuracy) have been mixed. This can be attributed to a number of factors, viz., slow convergence of gradient descent algorithms, inadequate problem mappings, and poor constraint satisfaction. In contrast, we have achieved considerable success with a new optimizing network architecture for problems in vision, learning, pattern recognition, and combinatorial optimization. This architecture is constructed by combining the following techniques: (1) deterministic anNeural Computation 8,1041-1060 (1996) @ 1996 Massachusetts Institute of Technology
A. Rangarajan, S. Gold, and E. Mjolsness
1042
nealing, ( 2 ) self-amplification, ( 3 ) algebraic transformations, (4) clocked objectives, and (5) softassign. Deterministic annealing ensures gradual progress toward a vertex of the hypercube (in combinatorial problems) and avoids poor local minima. Self-amplification in conjunction with annealing ensures that a vertex of the hypercube is reached. With the application of algebraic transformations and clocked objectives, the relaxation gets partitioned into distinct phases-highly reminiscent of the Expectation-Maximization (EM) algorithm. All the problems considered have permutation matrix constraints or minor variations thereof. The permutation matrix constraints get modified to doubly stochastic matrix constraints with the application of deterministic annealing. A new technique-softassip-is used to satisfy doubly stochastic matrix constraints at each temperature setting. First, previous work most closely related to our work is chronologically traced in Section 2. This helps us set up the derivation of our network architecture in Section 3 carried out with graph isomorphism as an example. The application of the network architecture to problem examples in vision, learning, pattern recognition, and combinatorial optimization is demonstrated in Section 4. The problems considered are ( I ) graph isomorphism and weighted graph matching (pattern recognition), (2) the traveling salesman problem (combinatorial optimization), ( 3 ) 2D and 3D point matching or pose estimation with unknown correspondence (vision), and (4) clustering with domain-specific distance measures (unsupervised learning). 2 Relationship to Previous Work
In this section, we begin with the traveling salesman problem (TSP) energy function first formulated by Hopfield and Tank (1985) and then, briefly, chronologically trace various developments that lead to our formulation. In Hopfield and Tank (1985), the TSP problem was formulated as follows: N
min Et,,(M) .M
=
1
N
N
MziMo-:~j$i~
(2.1)
n = l , = I i=1
where d,, is the distance between city i and city j with a total of N cities. (The notation a G 1 is used to indicate that subscripts are defined modulo N, i.e., M(N+I),= MI!.) In equation 2.1, M is a permutation matrix. (A permutation matrix is a square zero-one matrix with rows and columns summing to one.) Permutation matrix constraints are
1. C,"=,M,, = 1 (column constraint), 2. Ma, = 1 (row constraint), and 3. M,, E (0.1) (integrality constraint).
Novel Optimizing Network Architecture
1043
A permutation matrix naturally expresses the TSP constraints; each city is visited exactly once (Ct=lM,; = 1) and exactly one city is visited on each day of the tour (CE, M,; = 1).When the integrality constraint is relaxed, the permutation matrix constraints get modified to doubly stochastic matrix Constraints. (A doubly stochastic matrix is a square positive matrix with rows and columns summing to one.) Doubly stochastic matrix constraints are 1. C,"=,M,; = 1 (column constraint), 2. C z l M , ; = 1 (row constraint), and 3. M,; > 0 (positivity constraint).
In the original TSP energy function (Hopfield and Tank 1985), the doubly stochastic matrix constraints were enforced using soft penalties [penalty functions with fixed parameters as opposed to traditional penalty functions (Luenberger 1984)]and a barrier function. The energy function used can be written as E = Etsp+ E,,, with Etsp defined as in equation 2.1 and
where 4 is a [somewhat nontraditional (Luenberger 1984)]barrier function such as
$(Mol)
=
(U&L - log [I + exp(uaf)])
= Ma1 log(Mai) + (1 - Mni) lOg(1- Ma!) (2.3) The barrier function 4 ensures that the M,, are confined inside the unit hypercube [0,1INand this is tantamount to using the sigmoid nonlinearity (Mn8= l/[l+exp(-U,,)]). In Hopfield and Tank (1985),all the parameters A. B , C. P were set to fixed values. A lot of theoretical and experimental work (Wilson and Pawley 1988; Kamgar-Parsi and Kamgar-Parsi 1990; Aiyer et al. 1990) went into searching for valid parameter spaces. The overall conclusion was that it was impossible to guarantee that the network dynamics corresponding to gradient descent on the TSP energy function converged to a valid solution, namely a permutation matrix. Different energy functions in the same vein (soft penalties) (Mjolsness 1987)did not change the overall conclusions reached by Wilson and Pawley (1988). For example, the presence of invalid solutions was reported in Mjolsness (1987) for the constraint energy function
1044
A. Rangarajan, S. Gold, and E. Mjolsness
The energy function above (equation 2.4) has explicit row and column M:,]. (The seffpenalty functions and a self-amplification term [-(?/2) I,,, amplification term is explained in greater detail in Section 3 . ) While the two energy functions (equations 2.2 and 2.4) are very similar, they cannot be derived from one another even for special settings of the parameters. Their similarity stems from the choice of soft penalty functions for the constraints. The penalty functions in equation 2.4 express the row and column doubly stochastic matrix constraints (winner-take-alls). In Peterson and Soderberg (1989), Van den Bout and Miller (1989), Geiger and Yuille (1991), SimiC (1990), and Waugh and Westervelt (1993), after noting the similarity of equation 2.2 to mean field techniques in spin glasses [indicated in Hopfield and Tank (1985)], one of the constraints was enforced as a hard constraint using the Potts glass (Kanter and Sompolinsky 1987):
with the variable U playing a similar role as in equation 2.3. The column constraint has been dropped from the energy function in equation 2.4 and a new barrier function has been added which explicitly and exactly enforces the constraint C,, M,,, = 1 using the softmux nonlinearity (M,,,= exp(U,,,)/C,, expc U,,,)) (Bridle 1990). Also, annealing on the parameter ,I (the inverse temperature) was used [again indicated in Hopfield and Tank (1985)l. This combination of deterministic annealing, selfamplification, softmax, and a penalty term performed significantly better than the earlier Hopfield-Tank network on problems like TSP, graph partitioning (Peterson and Soderberg 1989; Van den Bout and Miller 1990), and graph isomorphism (SimiC 1991) The next step was taken in Kosowsky and Yuille (1994) to strictly enforce the row constraint C ,M,,,= 1 using a Lagrange parameter I ) :
In the above energy function, the column constraint is enforced exactly using the softmax and the row constraint is enforced strictly using a Lagrange parameter p . Both the row and column constraints are satisfied at each setting of the inverse temperature ,j albeit in different ways. While this looks asymmetric, that is only apparently the case (Yuille and Kosowsky 1994). To see this, differentiate equation 2.6 with respect to U
Novel Optimizing Network Architecture
1045
and set the result to zero. We get
Ma,log(M,,)
+ 1log I
exp (Ual) U
This objective function is still not quite symmetric. However, replacing (l/P) log C , exp (Ual)by v, c (vlis a new variable replacing Uul)we get
+
Ec,,
=
Ce.(ZMa,-1 I
1
) +c
+- E M , , [log(Mul) p
-
13
E M -1 '1
)
2 O l
(2.8)
a1
where we have set c = -l/,!?.This result was first shown in Yuille and Kosowsky (1994). The above constraint (equation 2.8) combines deterministic annealing (via variation of P), self-amplification (via the y term), and constraint satisfaction (via the Lagrange parameters p and n ) while keeping all entries of M nonnegative [via the x log(x) barrier function]. It plays a crucial role in our network architecture as described in Section 3. Note that it is also quite general-the constraint energy can be applied to any problem with permutation matrix constraints (or minor modifications thereof) such as TSP, graph isomorphism, point matching, and graph partitioning. Writing down the constraint energy function is not the whole story. The actual dynamics used to perform energy minimization and constraint satisfaction is crucial to the success of the optimizing network in terms of speed, accuracy, parallelizability, and ease of implementation. Networks arising from the application of projected gradient descent (Yuille and Kosowsky 1994) or subspace methods (Gee and Prager 1994) have the advantage of proofs of convergence to a fixed point. Such networks could eventually become viable candidates for optimization when implemented in analog VLSI, but they are typically too slow when implemented on digital computers to be competitive with traditional algorithms. In this paper, we derive a discrete-time neural network architecture from the constraint energy function equation 2.8. The discrete algorithms (resulting from the application of the architecture to specific combinatorial optimization problems) can be easily implemented on digital computers and we demonstrate the performance on several different problems:
A. Rangarajan, S. Gold, and E. Mjolsness
1046
graph isomorphism and matching, TSP, point matching, and clustering with smart distance measures. 3 Deriving the Network Architecture
We now describe the five techniques used in deriving the network architecture: (1) deterministic annealing, ( 2 ) self-amplification, (3) algebraic transformations, (4) clocked objectives, and (5) softassign. In the process, we also derive the corresponding discrete-time neural network algorithms using graph isomorphism (Mjolsness et al. 1989; SimiC 1991) as an example. The same network architecture is subsequently used in all applications. We formulate the graph isomorphism problem as follows: given the adjacency matrices G and 8 of two undirected graphs G(V . E ) and g(u.c),
subject to
1M,, ,1=1
=
M,, = 1. Va. M,,,f (0.1)
1. V i .
(34
i=1
where G,i, and g,,are in (0.1) with a link being present (absent) between nodes a and b in graph G (V . E ) if the entry G,,i, in the adjacency matrix G is one (zero). A similar condition holds for the adjacency matrix g corresponding to graph g(zl.e ) . Consequently, the adjacency matrices G and 2 are symmetric, with all-zero diagonal entries. Both graphs are assumed to have N nodes. In equation 3.2, M is a permutation matrixexactly the same requirement as in TSP-with only one "1" in each row and column indicating that each node a E (1... . . N} in graph G(V.E ) matches to one and only one node i E { 1 . .. . . N} in graph g ( v . e ) and vice versa. Following the mean-field line of development summarized in Section 2, the energy function for graph isomorphism (expressed in the form Cost + Constraints) can be written as
E where
= Egi
+ E'ms
(3.3) As detailed in Section 2, this form of the energy function has an x log(x) barrier function and two Lagrange parameters p and r/ enforcing the doubly stochastic matrix constraints along with a self-amplification term with a free parameter 7 . In the remainder of this section, we describe the corresponding network architecture.
Novel Optimizing Network Architecture
1047
3.1 Deterministic Annealing. The x log(x) barrier function in equation 3.3 keeps the entries in M nonnegative. It can also be seen to arise in a principled manner from statistical physics (Yuille and Kosowsky 1994), and we have already (in Section 2) briefly indicated its relationship to Potts glass (softmax) methods. The barrier function parameter 11 is similar to the inverse temperature in simulated and mean-field annealing methods and is varied according to an annealing schedule. Deterministic annealing ensures gradual progress toward a vertex of the hypercube. Varying the annealing parameter also provides some control on the nonconvexity of the objective function. At low values of 4, the objective function is nearly convex and easily minimized. While we have yet to describe our descent strategies, they are deterministic and performed within the annealing procedure.
3.2 Self-Amplification. Deterministic annealing by itself cannot guarantee that the network will converge to a valid solution. However, self-amplification (von der Malsburg 1990) in conjunction with annealing will converge for the constraints in equation 3.2 as shown in Yuille and Kosowsky (1994). A popular choice for self-amplification is the third term in equation 3.2 (Mjolsness 1987; Peterson and Soderberg 1989; Rangarajan and Mjolsness 1994):
Another closely related self-amplification term is of the form x(1 - x) (Koch ef al. 1986),which is functionally equivalent to equation 3.4 for the problems considered here. Self-amplification in conjunction with annealing ensures that a vertex of the hypercube is found. In our work, the self-amplification parameter 7 is usually held fixed, but in Gee and Prager (1994), the authors mention the use of annealing the self-amplification parameter that they term hysteretic annealing. In graph isomorphism and TSP, the choice of y plays an important role in governing the behavior of phase transitions and bifurcations (local and global). Some effort has gone into analysis of bifurcation behavior in TSP and graph partitioning in the context of self-amplificationused within Potts glass (softmax) approaches (Peterson and Soderberg 1989; Van den Bout and Miller 1990). 3.3 Algebraic Transformations. An algebraic transformation (Mjolsness and Garrett 1990)--essentially a Legendre transformation (Elfadel 1995)-transforms a minimization problem into a saddle-point problem. The advantage of this operation is two-fold: (1) it cuts network costs in terms of connections and (2) the transformed objectives are easier to
A . Rangarajan, S. Gold, and E. Mjolsness
1048
extremize. Consider the following transformation: XZ 2
-
max AX A
A'
--
2
(3.5)
X is typically an error measure but in principle can be any expression. The right side of equation 3.5 is an objective function to be maximized with respect to A. The maximization is trivial; A at its fixed point equals the expression X. However, the algebraic transformation makes the energy function linear in X, which (as we shall see) turns out to be quite useful. Using the transformation (equation 3.5) we transform the graph isomorphism objective to
The objective becomes linear in M [except for the xlog(x) barrier function] following the application of the transformation in equation 3.5 to the graph isomorphism and self-amplification terms. The algebraic transformation has made it possible to solve for M directly in equation 3.6. Two extra variables X and 0 have been introduced, which can now be separately controlled. This has come at the expense of finding a saddlepoint: minimization with respect to M. (T and maximization with respect to A. p , and 11. 3.4 Clocked Objectives. After performing the algebraic transformations, closed-form solutions can be obtained for A. M, and CT by differentiating equation 3.6 with respect to the appropriate variable and solving for it.
(3.7)
Novel Optimizing Network Architecture
1049
and (3.9) These closed-form solutions are to be used in an iterative scheme that cycles between the updates of the variables A, M, p, v, and CT (all mutually interdependent). The control structure for performing the network dynamics may be specified by a clocked objective function (Mjolsness and Miranker 1993): (3.10) where Egi (.), is a clocked objective: E ( x , y), optimizes with respect to x keeping y fixed in phase 1 and vice versa in phase 2. The two phases are iterated when necessary and the angle bracket notation can be nested to indicate nested loops. (.)" indicates an analytic solution within a phase. The notation p is used to indicate that the inverse temperature is increased (according to a prespecified annealing schedule) after the update of u. With this notation, the clocked objective states that closed-form solutions of A, 0, ( p ,M ) , and (v,M) are used by the network: The network dynamics for graph isomorphism begins by setting X to its closed form solution followed by an M update, which contains the (as yet undetermined) Lagrange parameters p and v. The exponential form of the closed-form solution for M keeps all its entries positive. Positivity of each entry of M followed by the correct setting of the Lagrange parameters ensures that a doubly stochastic matrix is obtained at each temperature. After A, M , and the Lagrange parameters converge, u is updated and /j increased. Clocked objectives allow analytic solutions within each phase, highly reminiscent of the EM algorithm (Jordan and Jacobs 1994). We have left unspecified the manner in which the Lagrange parameters are updated-this is the topic of the next section.
3.5 Softassign. The clocked objective in equation 3.10 contains two phases where the Lagrange parameters p and v corresponding to the row and column constraints have to be set. Gradient ascent and descent on the Lagrange parameters ( p ,v ) and on M, respectively, in equation 2.8 may result in a very inefficient algorithm. Gradient projection methods (Yuille and Kosowsky 1994), subspace and orthogonal projection methods (Gee et al. 1993; Gee and Prager 1994), and Lagrangian relaxation methods (Rangarajan and Mjolsness 1994) suffer from the same problems when implemented on digital computers. The principal difficulty these methods have is the efficient satisfaction of all three doubly stochastic matrix constraints (see Section 2). For example, in Gee and Prager (1994) and Wolfe et al. (1994), orthogonal projection followed by scaling or clipping is iteratively employed to satisfy the constraints.
A. Rangarajan, S. Gold, and E. Mjolsness
1050
Fortunately, doubly stochastic matrix constraints can be satisfied in an efficient manner via a remarkable theorem due to Sinkhorn (1964): a doubly stochastic matrix can be obtained from any positive square matrix by the simple process of alternating row and column normalizations. (Recall that in the previous section on clocked objectives, the exponential form of M in equation 3.8 ensures positivity.) Each normalization is a projective scaling transformation (Strang 1986).
Mn,
Mn, +
, : c
~
M"1
, a E
(1,.. . ,N}
(3.11)
The row and column constraints are satisfied by iterating equation 3.11. At first, Sinkhorn's theorem may appear to be unrelated to the constraint energy function in equation 3.3. However, this is not the case. Iterated row and column normalization can be directly related to solving for the Lagrange parameters ( [ L and v ) in equation 3.3. To see this, examine the solution for M above in the clocked objective:
where (3.13) The solution contains the two (still undetermined) Lagrange parameters and v. The clocked objective in equation 3.10 contains a phase where relaxation proceeds on the pair ( M ,[ i , ) and then on ( M ?v). Assume an odd-even sequence of updating where the (k + 1)th update of the Lagrange parameter !/, is associated with the (2k + 1)th update of M and the kth update of the Lagrange parameter v is associated with the 2kth update of M . Now, ~ ; f k + l )= exp
[p ( Q -~ ~
-
$)'I
(3.14)
and
Taking ratios, we get (3.16)
Novel Optimizing Network Architecture
1051
Setting the derivative of the energy function in equation 3.6 with respect to p to zero (dE,./dpL, = 0), we solve for the row constraint:
(3.17) From equations 3.15, 3.16, and 3.17, we get (3.18) We have shown that the clocked phase ( M ./ I , ) can be replaced by row normalization of M . A similar relationship obtains for the phase ( M .v ) and column normalization. Note that Q remains constant during the row and column normalizations. We have demonstrated a straightforward connection between our constraint energy function in equation 3.3 and Sinkhorn's theorem: solving for the Lagrange parameters in equation 3.6 is identical to iterated row and column normalization. Henceforth, we refer to this important procedure as iterative projective scaling since essentially a projective scaling operation (equation 3.11) is iterated until convergence is achieved. Iterative projective scaling coupled with the exponential form of M satisfies all three doubly stochastic matrix constraints. The x log(x) barrier function and the Lagrange parameters I L and v in the constraint energy function (equation 3.3) have been translated into an operation that first makes all entries in M positive (exponential form of M ) followed by iterated row and column normalization. Due to the importance of both these factors-positivity and iterative projective scaling-and due to the similarity to the softmax nonlinearity (which enforces either the row or the column constraint but not both), this operation is termed soffassign. The simplest optimization problem with two-way row and column permutation matrix constraints is the assignmenf problem (Luenberger 1984). Softassign satisfies two-way assignment constraints as opposed to softmax, which merely satisfies one-way winner-take-all constraints. Softassign is depicted in Figure la. The graph isomorphism algorithm is summarized in Figure l b and in the form of a pseudocode below.
Pseudocode for graph isomorphism Initialize ,6' to PO,Mal to &, gal to Mal Begin A Do A until P 2 &
6+
Begin B: Do B until all Mat converge or number of iterations > 10 xa, cr='=, GabMbi Malgp Qaf
+ -
Cr='=, GbaAbt f ELI &lgq
+ Yaai
A. Rangarajan, S. Gold, and E. Mjolsness
1052
Table 1: Definitions of Additional Symbols Used in the Graph Isomorphism Algorithm ~
Initial value of 13 Rate of increase of p Final value of p Random number uniform in [0,1] Maximum number of iterations at each 1) Maximum number of iterations for softassign
Mm
exp ( P Q n i ) Begin C: Do C until all M,, converge or number of iterations > II Update Ma, by normalizing the rows:
+
Ma, --#+
c,=, Ma,
Update M,, by normalizing the columns:
End C
Definitions of additional symbols used can be found in Table 1. In Kosowsky and Yuille (1994), softassign is used within deterministic annealing to find the global minimum in the assignment problem. And as we shall demonstrate, softassign is invaluable as a tool for constraint satisfaction in more difficult problems like parametric assignment (point matching) and quadratic assignment (graph isomorphism and TSP) problems. To summarize, deterministic annealing creates a sequence of objective functions that approaches the original objective function (as 13 is increased). Self-amplificationin conjunction with annealing ensures that a vertex of the hypercube is reached (for proper choices of the y parameter). Algebraic transformations in conjunction with clocked objectives help partition the relaxation into separate phases within which analytic solutions can be found. Doubly stochastic matrix constraints are satisfied by softassign. With the clocked objectives, analytic solutions and softassign in place, we have our network architecture. Figure 1 depicts the network architecture for graph isomorphism. Note that by adopting closed-form solutions and the softassign within our clocked objectives and by eschewing gradient descent methods (with associated line searches and/or gradient projections), we obtain discrete-time, parallel
Novel Optimizing Network Architecture
Positivity
1053
f
Two-way constraints Row Normalization
Col. Normalization
Figure 1: (a) Softassign. Given any square matrix { Q n l } , softassign returns a doubly stochastic matrix. (b) The network architecture for graph isomorphism. The pseudocode for the graph isomorphism algorithm explains the algorithm in greater detail.
updating neural networks. While, at present, we do not have proofs of convergence to fixed points (or limit cycles) for these networks, we report wide ranging success in using them to solve problems in vision, unsupervised learning, pattern recognition, and combinatorial optimization.
4 Problem Examples We now apply the same methodology used in deriving the discrete-time graph isomorphism network to several problems in vision, learning, pattern recognition, and combinatorial optimization. None of the networks used penalty functions, gradient descent with line search parameters, or packaged constraint solvers. The relevant free parameters in all the problems are (1) the annealing schedule for /3, (2) choice of the selfamplification parameter y (when applicable), (3) convergence criterion at each temperature, (4) convergence criterion for softassign, and (5) overall convergence of the network. In all experiments, Silicon Graphics workstations with R4000 and R4400 processors were used.
A. Rangarajan, S. Gold, and E. Mjolsness
1051
80
‘n100
x x
x
x
x
b
vf
a! r 0 c
5 .-
80
0
60
g40-
40
.-+
vf
c
0
8
L
b
0
c
60.
c
a! 2 20
g! 20
X
2
W Q
a 0
0.1
0.2
0.3
connectivity
0.4
0 standard deviation
Figure 2: (a) One hundred node graph isomorphism at connectivities of 1, 3, 5, and 10-50% in steps of 10; (b) SO and 100 node graph matching. 4.1 Graph Isomorphism and Matching. We have already examined the graph isomorphism problem and derived the corresponding network (Fig. 1). To test graph isomorphism, 100 node graphs were generated with varying connectivities (1, 3, 5, and 10 to 50% in steps of 10). Figure 2a depicts graph isomorphism for 100 nodes with 1000 test instances at each connectivity. Each isomorphism instance takes about 80 sec. The figure shows the percentage of correct isomorphisms obtained for different connectivities. The network essentially performs perfectly for connectivities of 5”k1 or more. In contrast, Simit’s deterministic annealing network (Simic 1991) could not reliably find isomorphisms for all connectivities less than 30% in 75 node random graphs. The major difference between the two networks is our use of softassign versus Simik’s use of softmax and a penalty function for the doubly stochastic matrix constraints. A second difference is our use of a discrete-time network versus Simik’s use of gradient descent. Elsewhere, we have reported closely related (slower but more accurate) Lagrangian relaxation networks (employing gradient descent) for 100 node graph isomorphism and matching (Rangarajan and Mjolsness 1994) and these also compare favorably with Simit’s results. In addition to graph isomorphism, we also tested graph matching (von der Malsburg 1988)for the restricted case of equal numbers of nodes and links in the two graphs. The network used for isomorphism is also applicable to matching. To test graph matching, 100 node graphs were generated with link weights in [O. 11. The distorted graph g(v.e) was generated by randomly permuting the nodes and adding uniform noise (at several different standard deviations) to the links. Figure 2b depicts graph matching for 100 and 50 nodes (no missing and extra nodes) with 200 and 1000 test instances, respectively, at each standard deviation. The
Novel Optimizing Network Architecture
600
I
-
I
I
I
1055
I
I
I
,
I
I
-
400-
-
200n
I
I
Figure 3: Histogram plot of tour lengths in the 100 city Euclidean TSP problem. 50 node and 100 node graph matching optimizations take about 10 and 80 sec, respectively. The results are markedly superior to three nonneural methods of graph matching reported in the literature, namely, linear programming (Almohamad and Duffuaa 1993),polynomial transform (Almohamad 1991), and eigendecomposition (Umeyama 1988) methods. The same architecture performs very well on inexacf graph matching and graph partitioning problems and this is reported along with favorable comparisons to relaxation labeling and softmax, respectively, in Gold and Rangarajan (1996a,b). 4.2 TSP. We have already described the TSP objective in equation 2.1 and listed some of the problems in the original Hopfield-Tank network (Hopfield and Tank 1985) and in its successors. In our work, we begin with the combination of the TSP objective and the constraint energy of equation 2.8. The self-amplification term in equation 2.8 is important for obtaining a permutation matrix and for controlling chaos. As usual, softassign is used to satisfy the doubly stochastic matrix constraints. The resulting clocked objective
+
with Q: = -/3 [Ez, d , (M(aell, M(nGIl,)- yMn,] is somewhat akin to the one used in Peterson and Soderberg (1989) with the crucial difference being the use of softassign instead of softmax (and a penalty term) for the doubly stochastic matrix constraints. The resulting algorithm is very easy to implement: iteratively set Q followed by softassign at each temperature.
A. Rangarajan, S. Gold, and E. Mjolsness
1056
We ran 2000 100-city TSPs with points uniformly generated in the 2D unit square. The asymptotic expected length of an optimal tour for cities distributed in the unit square is given by L ( n ) = K f i where 11 is the number of cities and 0.765 5 K 5 0.765 + 4 / n (Golden and Stewart 1985). This gives us the interval 17.65.8.051for the 100-city TSP. A histogram of the tour lengths is displayed in Figure 3. From the histogram, we observe that 989'0 of the tour lengths fall in the interval [8.11].No heuristics were used in either pre- or post-processing. A typical 100-node TSP run takes about 3 min. These results still do not compete with conventional TSP algorithms (and the elastic net) but they provide an improvement over the Hopfield-Tank and Potts glass (softmax) neural network approaches to TSP in terms of constraint satisfaction, number of free parameters, speed of convergence, convergence to valid solutions, and accuracy of solution. More details on the TSP formulation and experiments can be found in Gold and Rangarajan (1996b).
4.3 Point Matching. The point matching problem arises in the field of computer vision as pose estimation with unknown correspondence (Mjolsness and Garrett 1990; Gold ef a]., 1995; Gee ef al., 1993). The problem is formulated as the minimization of a distance measure with respect to the unknown spatial transformation relating the two point sets and the unknown point-to-point correspondences:
minD(x.y.M.T. t ) 5!' \.I T t
E,,(M. T.t i
t,1=I
subject to
r=l
'Ir (4.1) M,,, 5 1. and M,, E { 0.1)
M,,, 5 1. I
1
(4.2)
where .v and y are two 2D or 3D point sets of size N and n, respectively, and (T. t ) is a set of analog variables (rotation, translation, scale, shear). g ( T ) is a regularization of the parameters in T. M is a binary rrzntdi riintris indicating the correspondence between points in the two point-sets-similar to the matrix M in graph isomorphism, which indicates corresponding nodes. Unlike graph isomorphism, however, the two point-sets are of unequal sizes (N and 1 1 ) resulting in outliers-points in either set that have no corresponding points in the other set. The (1 term / i t > 0 ) biases the objective away from null matches. Since the objective is linear in M , no algebraic transformations are necessary. With M held fixed we solve for (T. f ) . With (T. f ) held fixed, we employ softassign for the doubly stochastic matrix constraints. The softassign operation is modified slightly to account for the inequality
Novel Optimizing Network Architecture
1057
constraints of equation 4.2. The resulting clocked objective is (4.3) More details on the point matching formulation and experiments can be found in Gold et al. (1995). 4.4 Clustering with Smart Distances. The point matching objective of equation 4.1 can be used as a distance measure inside an unsupervised learning objective function. The goal is to obtain point set prototypes and the cluster memberships of each input point set. Point sets with identical cluster memberships can differ by unknown spatial transformations and noise. Clustering with smart distances is formulated as
K
subject to
rnks
= 1, and
mks E
(0,l)
(4.4)
k=l
D(xs,yk. MSk.Tsk.tsk)is the point matching distance measure of equation 4.1. Here xs, s = 1 , .. . , S are the S input point sets, yk, k = 1... . .K are the K prototype point sets (cluster centers), and rn is the cluster membership matrix. Msk and (T", t s k )are the unknown correspondence and spatial transformation parameters between input point set s and prototype point set k. The clocked objective function analogous to equation 3.10 and equation 4.3 is
where X is a Lagrange parameter associated with the clustering constraint (equation 4.4). More details on the clustering formulation and experiments can be found in Gold et al. (1996). 5 Conclusions
We have constructed an optimizing network architecture that generates discrete-time neural networks. These networks perform well on a variety of problems in vision, learning, pattern recognition, and combinatorial optimization. In the problems considered, we have repeatedly encountered a set of variables arranged as a matrix with permutation matrix constraints (or minor variations thereof). This is no accident. The softassign is designed to handle just this kind of constraint and is mainly responsible for the speed and accuracy of the resulting optimizing networks. While softassign has been previously used to solve the assignment problem (Kosowsky and Yuille 1994),its effectiveness in parametric
A. Rangarajan, S. Gold, and E. Mjolsness
1058
(point matching) and quadratic assignment problems (graph matching,
TSP) has not been demonstrated, until now. In point matching, an efficient algorithm is obtained by solving in closed form for the spatial transformation parameters followed by softassign at each temperature. Likewise, in quadratic assignment (graph isomorphism and TSP), softassign eliminates the need for penalty functions, and gradient or orthogonal projection methods. Other important elements of the architecture are algebraic transformations and clocked objectives that partition tht relaxation into separate phases-reminiscent of the EM algorithm. Remaining elements of the architecture are deterministic annealing and self-amplification, which provide control on convexity and achieve conI'ergence at moderately low temperatures, respectively. The network architecture has been used to construct networks for large-scale (million variable) nonlinear optimization problems [see Gold et al. (1996)l. We believe that this work renews the promise of optimizing neural networks. Acknowledgments
~
We acknowledge the help of the reviewers (one in particular) in substantially improving the paper from an earlier version. This work is supported by AFOSR Grant F49620-92-J-0465, ONR/ARPA Grant N0001492-J-4048, and the Neuroengineering and Neuroscience Center (NNC), Yale University. We have been aided by discussions with Chien-Ping Lu, Suguna Pappu, Manisha Ranade, and Alan Yuille. References
~___
Aiyer, S. V. B., Niranjan, M., and Fallside, F. 1990. A theoretical investigation into the performance of the Hopfield model. I E E E Trms. Nrirrnl Nctzilorks 1(2), 204-215. Almohamad, H. A. 1991. A polynomial transform for matching pairs of weighted graphs. 1. April. Mntlr. itfudditig 15(4). Almohamad, H. A,, and Duffuaa, S. 0. 1993. A linear programming approach for the weighted graph matching problem. I E E E Trotis. Pott. Atzd. M o c k Intell. 15(5),522-525. Bridle, J. S. 1990. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In Adiwces it7 Nrirrd I ~ ~ f o r n m f Processing io~~ System 2, D. S. Touretzky, ed., pp. 211217. Morgan Kaufmann, San Mateo, CA. Elfadel, I. 1995. Convex potentials and their conjugates in analog mean-field optimization. Neirrnl Comp. 7(5), 1079-1104. Gee, A . H., and Prager, R. W. 1994. Polyhedral combinatorics and neural networks. Nrnrd Corrip. 6(1), 161-180. Gee, A. H., Aiyer, S. V. B., and Prager, R. W. 1993. An analytical framework for optimizing neural networks. Neitrnl Ncticlorks 6(1), 79-97.
Novel Optimizing Network Architecture
1059
Geiger, D., and Yuille, A. L. 1991. A common framework for image segmentation. Int. J. Computer Vision 6(3), 227-243. Gold, S., and Rangarajan, A. 1996a. A graduated assignment algorithm for graph matching. IEEE Trans. Paatt. Anal. Mach. Intell. (in press). Gold, S., and Rangarajan, A. 1996b. Softmax to softassign: Neural network algorithms for combinatorial optimization. J. Artificial Neural Networks. Special issue on neural networks for optimization. (in press). Gold, S., Lu, C. P., Rangarajan, A., Pappu, S., and Mjolsness, E. 1995. New algorithms for 2-D and 3-D point matching: Pose estimation and correspondence. In Advances in Neural Information Processing Systems 7 , G. Tesauro, D. Touretzky, and J. Alspector, eds., pp. 957-964. MIT Press, Cambridge, MA. Gold, S., Rangarajan, A., and Mjolsness, E. 1996. Learning with preknowledge: Clustering with point- and graph-matching distance measures. Neural Comp. 8(4): 787-804. Golden, B. L., and Stewart, W. R. 1985. Empirical analysis of heuristics, Chap. 7. In The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization, E. L. Lawler, J. K. Lenstra, A. H. G. R. Kan, and D. B. Shmoys, eds., pp. 207249. John Wiley, New York. Hopfield, J. J., and Tank, D. 1985. “Neural” computation of decisions in optimization problems. Biol. Cybern. 52, 141-152. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6(2), 181-214. Kamgar-Parsi, B., and Kamgar-Parsi, 8.1990. On problem solving with Hopfield networks. Biol. Cybern. 62, 415423. Kanter, I., and Sompolinsky, H. 1987. Graph optimisation problems and the Potts glass. J. Phys. A 20, L673-L677. Koch, C., Marroquin, J., and Yuille, A. L. 1986. Analog ”neuronal” networks in early vision. Proc. Natl. Acad. Sci. U.S.A. 83, 42634267. Kosowsky, J. J., and Yuille, A. L. 1994. The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks 7(3), 477490. Luenberger, D. 1984. Linear and Nonlinear Programming. Addison-Wesley, Reading, MA. Mjolsness, E. 1987. Control of attention in neural networks. In I E E E International Conferenceon Neural Networks (ICNN), Vol. 2, pp. 567-574. IEEE Press, New York. Mjolsness, E., and Garrett, C. 1990. Algebraic transformations of objective functions. Neural Networks 3, 651-669. Mjolsness, E., and Miranker, W. 1993. Greedy Lagrangians for neural networks. Tech. Rep. YALEU/DCS/TR-945, Department of Computer Science, Yale University. Mjolsness, E., Gindi, G., and Anandan, P. 1989. Optimization in model matching and perceptual organization. Neural Comp. 1(2), 218-229. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Inter. J. Neural Syst. 1(1), 3-22. Rangarajan, A., and Mjolsness, E. 1994. A Lagrangian relaxation network for
A. Rangarajan, S. Gold, and E. Mjolsness
1060
graph matching. In I E E E Iritertintioiinl Cot$rerice o t i Neirral Nekclorks ( I C N N I , Vol. 7, pp. 46294634. IEEE Press, New York. Sirnit, P. D. 1990. Statistical mechanics as the underlying theory of 'elastic' and 'neural' optimisations. Nc3ticlork 1, 89-103. Simic, P. D. 1991. Constrained nets for graph matching and other quadratic assignment problems. Ntwral C o r y . 3, 268-281. Sinkhorn, R. 1964. A relationship between arbitrary positive matrices and doubly stochastic matrices. Anti. Math. Sfntist. 35, 876-879. Strang, G. 1986. Iritrodirctiori to Ap~liedMathernatics, p. 688. Wellesley-Cambridge Press, Wellesley, MA. Umeyama, S. 1988. An eigendecomposition approach to weighted graph matching problems. I E E E Tror7s. Patt. Arid. Mncli. Illtell. 10(5), 695-703. Van den Bout, D. E., and Miller, T. K. 1989. Improving the performance of the Hopfield-Tank neural network through normalization and annealing. R i d . C!/Dt7r77. 62, 129-139. Van den Bout, D. E., and Miller, T. K. 1990. Graph partitioning using annealed neural networks. I E E E 7ixtis. Neirml Neticlorks 1(2), 192-203. von der Malsburg, C. 1988. Pattern recognition by labeled graph matching. Ntwrnl Ncfiuorks 1, 141-148. \xin der Malsburg, C. 1990. Network self-organization. In An Irztrodirctioiz to Neicrnl firid Elerfrotiic Netiiwks, S. F. Zornetzer, J. L. Davis, and C. Lau, eds., pp. 421332. Academic Press, San Diego, CA. Waugh, F. R., and Westervelt, R. M. 1993. Analog neural networks with local competition. I. Dynamics and stability. Phys. Rev. E 47(6), 45244536. Wilson, G. V.,and Pawley, G. S. 1988. On the stability of the traveling salesman problem algorithm of Hopfield and Tank. Biol. C!/berri. 58, 63-70. Wolfe, W. J . , Parry, M. H., and MacMillan, J. M. 1994. Hopfield-style neural networks and the TSP. In I E E E Itzferriatiotinl Corzference on Neirrul Nct7uorks IICNN), Vol. 7, pp. 45774582. IEEE Press, New York. Yuille, A . L., and Kosowsky, J. J. 1994. Statistical physics algorithms that converge. Neiirul Corny. 6(3), 341-356. ~-
-
Recelred August 17, 1994, accepted November 2, 1995
This article has been cited by: 2. A. L. Yuille , Anand Rangarajan . 2003. The Concave-Convex ProcedureThe Concave-Convex Procedure. Neural Computation 15:4, 915-936. [Abstract] [PDF] [PDF Plus] 3. Ing-Tsung Hsiao, Anand Rangarajan, Gene Gindi. 2003. Bayesian image reconstruction for transmission tomography using deterministic annealing. Journal of Electronic Imaging 12:1, 7. [CrossRef] 4. N. Rubanov. 2003. SubIslands: the probabilistic match assignment algorithm for subcircuit recognition. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22:1, 26-38. [CrossRef] 5. Yuyao He. 2002. Chaotic simulated annealing with decaying chaotic noise. IEEE Transactions on Neural Networks 13:6, 1526-1531. [CrossRef] 6. A. L. Yuille . 2002. CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief PropagationCCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation. Neural Computation 14:7, 1691-1722. [Abstract] [PDF] [PDF Plus] 7. Chuangyin Dang , Lei Xu . 2002. A Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman ProblemA Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman Problem. Neural Computation 14:2, 303-324. [Abstract] [PDF] [PDF Plus] 8. Bin Luo, E.R. Hancock. 2001. Structural graph matching using the EM algorithm and singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:10, 1120-1136. [CrossRef] 9. Shin Ishii , Hirotaka Niitsuma . 2000. λ-Opt Neural Approaches to Quadratic Assignment Problemsλ-Opt Neural Approaches to Quadratic Assignment Problems. Neural Computation 12:9, 2209-2225. [Abstract] [PDF] [PDF Plus] 10. Marcello Pelillo . 1999. Replicator Equations, Maximal Cliques, and Graph IsomorphismReplicator Equations, Maximal Cliques, and Graph Isomorphism. Neural Computation 11:8, 1933-1955. [Abstract] [PDF] [PDF Plus] 11. Anand Rangarajan , Alan Yuille , Eric Mjolsness . 1999. Convergence Properties of the Softassign Quadratic Assignment AlgorithmConvergence Properties of the Softassign Quadratic Assignment Algorithm. Neural Computation 11:6, 1455-1474. [Abstract] [PDF] [PDF Plus] 12. Andrew M. Finch , Richard C. Wilson , Edwin R. Hancock . 1998. An Energy Function and Continuous Edit Process for Graph MatchingAn Energy Function and Continuous Edit Process for Graph Matching. Neural Computation 10:7, 1873-1894. [Abstract] [PDF] [PDF Plus] 13. S. Gold, A. Rangarajan. 1996. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:4, 377-388. [CrossRef]
14. A. Rangarajan, E.D. Mjolsness. 1996. A Lagrangian relaxation network for graph matching. IEEE Transactions on Neural Networks 7:6, 1365-1381. [CrossRef]
Communicated by Todd Leen
Online Steepest Descent Yields Weights with Nonnormal Limiting Distribution Sayandev Mukherjee Terrence L. Fine School of Electrical Engineering, Cornell University,lthaca, NY 14853 USA We study the asymptotic properties of the sequence of iterates of weightvector estimates obtained by training a feedforward neural network with a basic gradient-descent method using a fixed learning rate and no batch-processing. Earlier results based on stochastic approximation techniques (Kuan and Homik 1991; Finnoff 1993; Bucklew et al. 1993) have established the existence of a gaussian limiting distribution for the weights, but they apply only in the limiting case of a zero learning rate. We here prove, from an exact analysis of the one-dimensional case and constant learning rate, weak convergence to a distribution that is not gaussian in general. We also run simulations to compare and contrast the results of our analysis with those of stochastic approximation. 1 Introduction
The wide applicability of neural networks to problems in pattern classification and signal processing has been due to the development of efficient and powerful gradient-descent algorithms for the supervised training of feedforward neural networks with differentiable node functions. A basic version uses a fixed learning rate and updates all weights after each training input is presented (stochastic or online mode) rather than after the entire training set has been presented (batch mode). We shall make the customary (Kuan and Hornik 1991; Finnoff 1993; Bucklew et al. 1993) assumption of an infinite i.i.d. training set 7 = {(Xtl,Yn),n2 1) c Rd x RP. Consider the training of a neural network ) input x and weight vector w] by means of the [with output ~ ( x , wfor stochastic BPA, with a fixed learning rate p. Then, writing {Wp} for the sequence of estimates to emphasize the dependence on p, the algorithm updates as w:+1 =
w: - p ~ w ~ ( X , , +Y1n,+ l , W ) ,
where
Neural Computation 8, 1075-1084 (1996) @ 1996 Massachusetts Institute of Technology
1076
Sayandev Mukherjee and Terrence L. Fine
The properties of this algorithm as exhibited by the sequence of iterates are not yet well understood. Stochastic approximation techniques (Bucklew et al. 1993; Finnoff 1993; Kuan and Homik 1991; White 1989) study the limiting behavior of the stochastic process that is the piecewise-constant or piecewise-linear interpolation of the sequence of weight-vector iterates (assuming infinitely many i.i.d. training inputs) in the limit of zero learning rate, and it can be shown (Bucklew et ni. 1993; Finnoff 1993) that the fluctuation between the paths and their limit, suitably normalized, tends to a gaussian diffusion process. However, for a fixed nonzero learning rate, these methods do not even tell us if the sequence of iterates converges in distribution, let alone the form of this limiting distribution, though it is often claimed that the same result holds if the learning rate is small. This paper considers a fixed nonzero learning rate and studies the sequence of weight-vector iterates as a discrete-time continuous state-space Markov process. An exact analysis (in Section 2 ) using the Aperiodic Ergodic Theorem (Meyn and Tweedie 1993) for the positive recurrence of a Markov process shows that in the simplest case of a single sigmoidal node with one parameter trained using the stochastic form of the basic gradient-descent training algorithm, the sequence of iterates of the parameter converges in distribution to one that is in general nongaussian, thereby qualifying the oft-stated claims in the literature (see, e.g., Bucklew et a!. 1993; White 1989). We wish to emphasize that though our result on the existence of a nongaussian limiting distribution has been proved for the simplest case of a single sigmoidal node, it implies that in general, even if the limiting distribution exists, it is nongaussian. In the final section, we give the results of simulations to experimentally test our conclusions and study the extent of the discrepancy between them and those of stochastic approximation. 2 Application of the Aperiodic Ergodic Theorem 2.1 Introduction. The approaches of stochastic approximation yield differential equations for the interpolated process obtained from the weights in the limit of a zero learning rate. The assumption required for these results to be applicable in practice is that it be small enough for the differential equation to be a good approximation to the actual behavior [see Kuan and Hornik (1991) for details]. However, this is only possible if the ODE has a unique, globally attractive equilibrium, which is usually not the case. Thus the large-time asymptotics of an algorithm and its corresponding ODE could be very different indeed if the learning rate is constant but nonzero. An alternative approach is to keep fixed, and deal with the sequence of updates W,, directly, instead of with the interpolated process. Note that
Online Steepest Descent
1077
for an i.i.d. sequence { (X,, Y,)}, this defines a time-homogeneousMarkov chain. This Markov chain is discrete in time but has a continuous state space (wm,where m is the dimension of the weight vector). For a fixed p, the limiting behavior of the algorithm as n + cc is determined by the corresponding properties of the Markov chain. The advantage of this approach is that we are no longer restricted to the limiting case of zero p. However, the problem is analytically difficult because of the complicated theory of Markov chains of this kind. In Appendix A, we give a very brief and incomplete summary of important definitions and theorems used by us here. Meyn and Tweedie (1993) contains an exhaustive account. 2.2 Analysis of a Single Node Network Existence of a Limiting Distribution. A single sigmoidal node ~ ( xw) , = v(wTx)= 1/[1+e-(wTx)] is trained by the stochastic version of the BPA. The infinite sequence of i.i.d. training samples { (X,. Y,)} is assumed to come from the following simple model:
Yn = S ~ ( W O X ~ : )
+ Zn:
1 X, i.i.d. P{X,, = 1) = - = P { X , 2
=
-1}, Z, i.i.d. N(0,a2)
and Z , independent of X, for each n = 1,2,.. .. A single node net is optimal for this data model. The stochastic BP weight update formula becomes =
m>l+~ fi
n -Sp(fin)pc,An
+ pcnZn+I, n = 0 , 1 , .. . .
(2.1)
where Z,,+l = ZnflXn+l has the same distribution as Z,, viz. N(0,a2), fi?,= Wn-zu0,c,=v'(Wn),A,,= lv(W,l)-q(wO)l,n=O.l ,..., andweuse , = ~ ' ( w and ) , X = +l or - 1. the fact that v(-w) = 1 - ~ ( w )T((-w) The central result of this paper is Theorem 1. Tkesequence ofiterates { W l , }of tkeonlinegradient descent training algorithm (equation 2.1) converges in distribution as n 00. --f
The proof of this result uses the Aperiodic Ergodic Theorem from the theory of Markov chains and is given in Appendix B. Remark. The same procedure could be used to establish convergence in distribution of the iterates {W?,} in the general case, with inputs x E Rd, and W = w"'. The catch is the intractability of proving the drift condition (equation A.l) for some suitable choice of g (see Proposition A.4) in this general case. Alternative criteria (e.g., Pflug 1986; Kushner and Huang 1981) cannot be used either, since they rely upon the assumption that
1078
Sayandev Mukhejee and Terrence L. Fine
. wY1I2/2 ) and wo is the (assumed unique) where f(w) = V W ~ / / v ( X minimum of the true error surface E / ( q ( X .w) - Y/I2/2, and this assumption does not even hold in the simple one-dimensional case r/(x.w) = 1/11 + exp(-wx)] considered above. An exact proof of convergence in distribution in the general case using the theory of Markov chains therefore remains to be established. Convergence can be shown, however, if one employs a certain linearization approximation (see Mukherjee and Fine 1995). 2.2.1 The Nonnornzalityof the Limiting Distribution. An attempt has been made by Orr and Leen (1993; Orr 1995) to determine the limiting distribution analytically by means of the Chapman-Kolmogorov equation. Their approach involves obtaining a discrete-time Kramers-Moyal expansion for the density p\;v,,( W ) from the Chapman-Kolmogorov equation and finding a perturbative solution of this Kramers-Moyal expansion for the equilibrium (i.e., limiting) distribution. For the LMS algorithm [i.e, assuming a single linear neuron, rl(x. w) = w'x], they are able to show that the limiting distribution is not gaussian, but approaches a gaussian as I' 0. This supports our assertion. However, it is analytically impossible to either establish the existence of a limiting distribution or calculate the form of this distribution for the general case where the learning algorithm is not LMS, using this approach. Thus it would seem that for the case of a nonlinear neuron that we consider here, an approach based on analogy to physics by means of the Chapman-Kolmogorov equation is not successful. We shall, therefore, not attempt to solve the Chapman-Kolmogorov equation directly for the limiting density (which we have proved to exist for the problem under consideration), but will try to gain information about it. Note that the transition density is
-
The Chapman-Kolmogorov equation is
j zi$)pi\,,(.rcJ,l1 n7k Thus if the limiting distribution of { W , , } has density pk(lo), it has to satisfy the equation
That the solution of equation 2.3 cannot be gaussian may be deduced
Online Steepest Descent
1079
from either of the following arguments: 1. The form of equation 2.2 shows that no gaussian density can satisfy equation 2.3. 2. Substituting for the form of the transition density, and evaluating EW = J_“,Wpiy(W) dw yields the result that if we make the reasonable assumption that EW = 0 (and as asserted by stochastic approximation methods), then we should have
=
r
7/I(W’
+ wo)[r/(W’+ w O )
- ~~(w~)]pp,,(w’) dw‘
(2.4)
It is clear that equation 2.4 does not hold if pl;~is even. Therefore the limiting distribution cannot be a zero-mean gaussian as asserted by stochastic approximation analyses. 2.3 Reconciling Nonnormality and Stochastic Approximation. The results of stochastic approximation analysis give a gaussian distribution for W in the limit as p -+ 0 (Bucklew et al. 1993; Finnoff 1993). On the other hand, our results establish that the gaussian distribution result is not valid for nonzero /L in general. However, the two conclusions may be reconciled if the limiting distribution for small nonzero IL, though nongaussian, is nevertheless very close to gaussian. This is supported by the fact that the first four moments of a suitably normalized random variable with the limiting distribution approach those of a gaussian as ,u -+ 0 (Mukherjee and Fine 1995). To test this idea, simulations were done on a simple one-dimensional training problem for 8 cases: ,u = 0.1,0.2,0.3,0.5, and c2 = 0.1.0.5 for each value of p, with wo fixed at 3. For each of the 8 cases, either 5 or 10 runs were made, with lengths (for the given values of p ) of 810,000, 500,000, 300,000, and 200,000, respectively. Each run gave a pair of sequences {W?,} obtained by starting off at WO = 0 and training the network independently twice. Each resulting sequence { W‘,} was then downsampled at a large enough rate that the true autocorrelation of the downsampled sequence was less than 0.05, followed by deleting the first 10% of the samples of this downsampled sequence, so as to remove any dependence on initial conditions that might persist. This was done to ensure that the elements of the resulting downsampled sequences could be assumed independent for the various hypothesis tests that were to follow.
1. For each run of each case, the empirical distribution functions of the two downsampled sequences thus generated were compared by means of the Kolmogorov-Smirnov test (Bickel and Doksum 1977) at level 0.95, with the null hypothesis being that both sequences had the same actual cumulative distribution function (assumed continu-
Savandej. Mukherjee and Terrence L. Fine
1080
ous). This test was passed with ease on all trials, thereby supporting the claim that a limiting distribution existed and was attained by such a training algorithm. 2. For each run of each case, a skewness test and a kurtosis test (both based on moments-see, e.g., Bickel and Doksum 1977) for normality were done at level 0.95 to test for normality. Curiously, the sequences generated failed both tests for the ( / I . (T) pair (0.1,O.l) but passed them both for the pairs (0.1,0.5), (0.3,0.1), (0.5,0.1), and (0.5,0.5). For the pair (0.2,0.5), the skewness test was passed and the kurtosis test failed, and for the pairs (0.2,O.l)and (0.3,0.5),the skewness test was failed and the kurtosis test passed. 3. All trials cleared the Kolmogorov tests (Bickel and Doksum 1977) for normality at level 0.95, both when the normal distribution was taken to have the sample mean and variance (computed on the downsampled sequence), and when the normal distribution function had the asymptotic values of mean (zero) and variance (//02/2) given by stochastic approximation analyses. Hence we may conclude: I . { W,,} converges in distribution. 2. For small values of 1'. the deviation from normality is so small that the gaussian distribution may be taken as a good approximation to the limiting distribution.
In other words, though stochastic approximation analysis states that ~ j v $ is gaussian only in the limit of LTanishing (see, e.g., Bucklew cf 111. 1993), our simulation shows that this is a good approximation for small values of I , as well. Appendix A: Markov Chains-Some
Definitions and Theorems __
{Wll);:=,j'-be a Markov chain taking values in the state space and let B denote the Bore1 0-algebra on W'. We denote the n-step transition probability P{W,, E A 1 Wo = w} by P"(w.Aj, for all A E B, and will use P, for the probability conditioned on Wo = w . The first time of return to A E B is defined by Q = min{n L 1 W,, E A ) , and L(w.A)= P , ( T ~ < x).
Let W LV =
=
~j-l'',
Definition A.l (Meyn and Tweedie 1993, p. 87). W = { W 1 7is ) called o-irreducible if there exists a measure o on B such that
(VA E B ) ( o ( A )> 0 + (VW E W)L(w.Aj> 0)
Proposition A.l (Meyn and Tweedie 1993, Prop. 4.2.2). rfW is 0-irreducible Fir some measure o,tlzeii there exists a probability measure 1 ' on B such that 1, W is (3-irreducible:
Online Steepest Descent
1081
2. for any other measure $', the chain W is @-irreducibleifand only if+ ti.e., $ dominates #I); 3. if$(A) = 0, then +{w: L(w,A) > 0) = 0; 4. the probability measure 11, is equivalent to +'(A)
=
1
W
+ 4'
@(dw)EPn(w,A)2-("+') n=O
for anyfinite irreducibility measure 4'. ?I, is called a maximal irreducibility measure for W, and we will simply say W is $-irreducible. We also denote by B+ the sets A with $(A) > 0.
Definition A.2 (Meyn and Tweedie 1993, p. 106). A set C E B is called a small set ifthere exists an m > 0, and a non-trivial measure u,,, on B, such that for all w E C , B E B, Pm(w,B)2 um(B). When this holds, we say that C is urn-small. Definition A.3 (Meyn and Tweedie 1993, p. 118). Suppose that W is a 4irreducible Markov chain. When there exists a ul-small set A with ul (A) > 0, then the chain is called strongly aperiodic. Definition A.4 (Meyn and Tweedie 1993, p. 121). Let a = { a ( n ) }be a distribution, or probability measure, on {0,1,.. .}. We will call a set C E B u,-petite if the sampled chain W, given by the probability transition kernel cu
K,,(w,A)= xP"(w,A)a(n),w E W. A E B ti=a
satisfies (Vw E C ) ( W E B)K,(w,B) 2 un(B),where u,, is a nontrivial measure on B. Proposition A.2 (Meyn and Tweedie 1993, Prop. 5.5.3). IfC E Bis u,-small, then C is ug,,-petite. Definition A S (Meyn and Tweedie 1993, p. 200). The set A E B is called Harris recurrent iffor all w E A,P,{W E A i.o.} = 1. A chain W is called Harris (recurrent) if it is $-irreducible and every set in B+ is Harris recurrent. Definition A.6 (Meyn and Tweedie 1993, p. 229). A a-fnite measure 7r on I3 with the property that for all A E B , 7r(A)= Jw7r(dw)P(w.A), will be called invariant. Proposition A.3 (Meyn and Tweedie 1993, p. 230). IfW is Harris recurrent, then it admits a unique (up to constant multiples) invariant measure 7r. Definition A.7. A measure LL is called regular iffor any set E E B, p ( E ) = inf{p(O): E 2 0.0 open} = sup{p(C): C C E. C compact}
Definition A.8 (Meyn and Tweedie 1993, p. 311). I f p is a signed measure
1082
Sayandev Mukherjee and Terrence L. Fine
on B then the total variation norm
/I//// is defined as
inf / / ( A )
A€B
The main theorem we wish to use to establish convergence in distribution of the iterates {Wll}will be the following:
Proposition A.4 [Aperiodic Ergodic Theorem (Meyn and Tweedie 1993, Thm. 13.0.1)J. Suppose that W is a n aperiodic Harris recurrent c h i n , with
invariant measure 7;. The folloillitig are equivalent: 1. The unique invariant measure i~ is finite (rather than merely a-finite), so that it may be normalized to a stationary probability measure. 2. There exists soine petite set A , some b < x and a iioiinegatizfefunctioii g finite at some wg E W = x"',satisfying IS(W) = /P(W.dv)g(v)-,giw) 5 -1
+ bl,A(w).
wEW
(A.l)
Either of these conditions is equivalent to the existence of a unique invariant probnbdity i?ie(lsure;i such that for eziery initial condition w E W , lim l l P " ( w . . ) - ~ i l=l 2 1 f i ~ s u p l P " ( w . B ) - ; i ( B ) / = 0
I 1 ---1
RE8
and ~noreoz~er for an!/ regular initial distributions A. 11,
Since M.' = R"'is topologcnl, i t is also the case that total variation conzierpice nnp/ie>zvrak conzwgenci~of the measures in question. Appendix B: Proof of Theorem 1
From equation 2.1, it is clear that the sequence (fill. 17 = 1.2.. . .} is a Markov chain with one-step transition law given by P(il1,.A ) = P{fiIlilE A El, = ill}, which is gaussian (since Z,,,, is the only r.v. left) and continuous in 21, since 11 and I/' are continuous in i i l . Since P(ii7.A) > 0 for any A for which ,\(A)> 0, {Wll} is A-irreducible, where A is Lebesgue measure on X. ~
Proposition B . l . For ally 6 > 0, the set A = [ -6. h ] is ~ / ~ - s n z a li.e., l , there exists nontrivial measure 111 on B such that for all i i i E A and B E B, P(ii1.B) 2 ill (€3). Thertfore { fill}is strongly aperiodic.
ti
Proof. For any ti, we may write P(icl.dz~j= f,.; ( o ) d v , where fc;.(v) is gaussian. In the present situation, c.,~= inf,;.,=..Iinfi.,,lfv,;(z?) > 0, and A is iq-small, where 711 (B)= c , ~ 1,4( zfj ~ Z J . 0 Note that for any h ( % j E C(w),the set of continuous real-valued functions on the reals,
Online Steepest Descent
.I,P ( 5 . dv)h(v)
=
1083
fv,.(v)h(v)dv
.w
E C(w)
so {W,,} is a Feller chain, as per the definition in (Meyn and Tweedie 1993, p. 128). Let $ be the maximal irreducibility measure for { W , l } . Since is X-irreducible, it is clear from the maximality of $ that 11, + X and so must have a support with non-empty interior. Then all compact (i.e., closed and bounded) subsets of W = R are petite (Meyn and Tweedie 1993, Prop. 6.2.8). In particular, A = [-6.61 is petite, but this also follows directly from the fact that A is small. Let g: R + R, be defined by u H exp(lv1). Then, for any n < 00, the sublevel set C,(n) = {v: g(v) 5 M } = [-log,n,log,n]is petite, as it is a compact subset of R. From the definition in (Meyn and Tweedie 1993, p. 191), this means that g(.) = exp(I . I) is unbounded offpetite sets for { W , , } . We next show Proposition 8.2 (Mukherjee 1994). There exists a S > 0, and some b > 0 suck that with g ( . ) = exp(I . I) us above, the drqt condition
Ag(v) = / P ( v , d u ) g ( u )- g ( v ) 5 -1 is satisfied for the petite set A
=
+ blA(v).
uEw
(B.1)
[-6,0].
Proof. Note that this is equivalent to establishing that there exist some S > 0 , t > 0,Q< 00 such that
where g(.) = exp(/. I). The proof of equation B.2 is by straightforward but tedious calculation, and details may be found in Mukherjee (1994). Equation B.3 follows directly, because for all v E [-6,6], .
. M
=
S_,
exp(l~ri- sgn(Gn)pcnAri+ ~ n z l ) p z ( zd)z
which is bounded for given 6, p, CT,since Z is gaussian.
0
The above pieces are combined to prove Proposition B.3. { W f l }is Harris recurrent. Proof. Apply Meyn and Tweedie (1993 Thm. 9.1.8) to the $-irreducible chain { W n } ,petite set A = [-S,S],and function g(,) = exp((. I), which is 0 unbounded off petite sets such that equation B.2 holds. We have shown that { Wn}is strongly aperiodic and Harris recurrent, and also established the drift condition equation B.l. The Aperiodic Ergodic Theorem then gives us Theorem 1.
Sayandev Mukhejee and Terrence L. Fine
1084
Acknowledgments
-__
We thank the referees for their valuable suggestions. This work was supported in part by the National Science Foundation. References
-
Bickel, P., and Doksum, K. 1977. Mnthenznticd Statistics: Basic Ideas nrzd Selected Topics. Holden-Day, San Francisco, CA. Bucklew, J. A,, Kurtz, T. G., and Sethares, W. A. 1993. Weak convergence and local stability properties of fixed step size recursive algorithms. I E E E Trans. lizfornz. Theory 39, 966-978. Finnoff, W. 1993. Diffusion approximations for the constant learning rate backpropagation algorithm and resistance to local minima. In A d ~ ~ a i i c in e s Neural lizforrniatioiz Procrssiiig Systems 5, C. L. Giles, S. J. Hanson, and J. D. Cowan, eds., pp. 459466. Morgan Kaufmann, San Mateo, CA. (Later published in Neirml Comp. 6(2), 285-295, 1994.) Kuan, C.-M., and Hornik, K. 1991. Convergence of learning algorithms with constant learning rates. IEEE Trniis. Neural Nefivorks 2, 484488. Kushner, H. J., and Huang, H. 1981. Asymptotic properties of stochastic approximation with constant coefficients. SIAM J. Coritrol Optiniizatioii 19, 87-105. Meyn, S. P., and Tweedie, R. L. 1993. Mnrkozj Choiris a i d Stoclinsfic Stability. Springer-Verlag, Berlin. Mukherjee, S. 1991. Asymptotics of gradient-based neural network training algorithms. Master’s thesis, Cornell University, Ithaca, NY. Mukherjee, S., and Fine, T. L. 1995. Asymptotics of gradient-based neural network training algorithms. In Adzwices iii N t w d lizforinntion Processing Systenis 7, C . Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 335-342. MIT Press, Cambridge, MA. Orr, G. B. 1995. Dynamics and algorithms for stochastic search. Ph.D. dissertation, Oregon Graduate Institute of Science and Technology, Portland, OR. Orr, G. B., and Leen, T. K. 1993. Probability densities in stochastic learning: 11. Transients and basin hopping times. In Adzmrices in Neural Itzfornzatiori Processing Systcvzs 5, C. L. Giles, S. J. Hanson, and J. D. Cowan, eds., pp. 507514. Morgan Kaufmann, San Mateo, CA. Pflug, G. Ch. 1986. Stochastic minimization with constant step size: Asymptotic laws. SIAM J. Control Optimization 24, 655-666. White, H. 1989. Some asymptotic results for learning in single hidden-layer feedforward network models. 1.Am. Stntist. Assoc. 84, 1003-1013.
Received July 27, 1995; accepted November 22, 1995.
This article has been cited by: 2. W. Wu, G. Feng, Z. Li, Y. Xu. 2005. Deterministic Convergence of an Online Gradient Method for BP Neural Networks. IEEE Transactions on Neural Networks 16:3, 533-540. [CrossRef] 3. Terrence L. Fine , Sayandev Mukherjee . 1999. Parameter Convergence and Learning Curves for Neural NetworksParameter Convergence and Learning Curves for Neural Networks. Neural Computation 11:3, 747-769. [Abstract] [PDF] [PDF Plus] 4. James C. SpallStochastic Optimization, Stochastic Approximation and Simulated Annealing . [CrossRef]
Communicated by John Wyatt and Eric Mjolsness
Gradient Projection Network: Analog Solver for Linearly Constrained Nonlinear Programming Kiichi Urahama Department of Visual Communication Design, Kyushu Institute of Design, Shiobaru, Fukuoka, 815 Japan
An analog approach is presented for solving nonlinear programming problems with linear constraint conditions. The present method is based on transformation of variables with exponential functions, which enables every trajectory to pass through an interior of feasible regions along a gradient direction projected onto the feasible space. Convergence of its trajectory to the solution of optimization problems is guaranteed and it is shown that the present scheme is an extension of the affine scaling method for linear programming to nonlinear programs under a slight modification of Riemannian metric. An analog electronic circuit is also presented that implements the proposed scheme in real time. 1 Introduction The Hopfield network (Hopfield and Tank 1985) is an analog system useful for solving mathematical programming problems including linear programs and combinatorial optimization problems in real time. The penalty function method employed in the Hopfield network enables it to easily treat complex constraint conditions. This easiness in its design facilitates its wide applications to various mathematical programming problems. Penalty parameters are, however, extremely hard to set to ensure that obtained solutions satisfy constraint conditions strictly. This difficulty in parameter setting is attributed to the poor condition of the penalty method, which has been well-known to mathematical programming practitioners. In the mathematical programming field, more wellconditioned methods, e g , the successive quadratic programming scheme based on the Lagrange multiplier technique and the Newton method, are widely used instead of the poorly conditioned penalty method. These highly effective algorithms, however, require involved digital computation and their analog implementation is very hard. A distinctive feature of the Hopfield network is its implementability on analog hardwares, which can produce a solution in real time. Thus it is possible to develop a highly reliable analog optimization method if we employ a solution scheme implementable with analog hardwares as an alternative to the Neural Computation
8, 1061-1073 (1996) @ 1996 Massachusetts Institute of Technology
Kiichi Urahama
1062
penalty method used in the Hopfield networks. Many analog algorithms have been presented, from a classical one by Arrow et ol. (1958)to a neural scheme of Platt and Barr (1987), who developed the differential multiplier method. Another candidate is the barrier function method, which is a kind of interior point method that always gives legal solutions in contrast to the penalty method, which belongs to a class of exterior point methods. Every trajectory of interior point methods always stays within a feasible region, hence the final solution is guaranteed to be legal. On the contrary, trajectories of exterior point methods can pass through an exterior outside the feasible region to sometimes arrive at an illegal solution. Recent success in the application of the barrier function method to linear programming by Karmarkar (1984) is an epoch-making demonstration of effectiveness of interior point methods. Our concern in the present paper lies in implementability of interior point methods with analog hardwares Some special neural algorithms classified into interior point methods have been presented for problems where variables are constrained to be probability vectors, i.e., each element is restricted between zero and one and sums to one; this is called the winner-take-all behavior in the neural network field (Peterson and Siiderberg 1989; Urahama and Ueno 1993; Urahama 1994a). This paper is addressed to more general nonlinear programming problems with linear constraints including the winner-takes-all as a special case. An analog interior point method is presented for solving these problems and an analog electronic circuit is devised for executing the present scheme in real time. The present solution system called a gradient projection network is an extension of the algorithm for assignment problems presented by Yuille and Kosowsky (1994) to general nonlinear programming problems. 2 Transformation of Variables with Exponential Functions
Let us consider nonlinear programming problems formulated by min f i x ) subj.to As = b
120 where the inequality for the vector x represents elementwise inequalities. This class of problems described by equation 2.1 includes linear programs and continuous relaxation problems of 0-1 integer programs. Instead of solving equation 2.1 directly, let us consider a scheme to obtain a solution of equation 2.1 at the limit of (1 1 0 of the problem 11
min
fix)
+ nCx,(lnx, 1.=1
subj to A s = b
s>o
-
1)
Gradient Projection Network
1063
where the additional entropy term is familiar in neural network fields and where the objective functionf is called an energy and the augmented objective function in equation 2.2 is called a free energy. The entropic term is necessary for stable functioning of the analog circuit, which will be presented at the end of this section and enables us to incorporate deterministic annealing procedures. Note that equation 2.2 differs from barrier functions where the coefficient of entropy terms is nonpositive but Q in equation 2.2 is nonnegative. Let us now remove the last nonnegativity constraint in equation 2.2 and deal with the reduced problem with only the equality constraint Ax = b. It will be shown later that this reduction is possible because every trajectory in a devised analog system preserves nonnegativity of x. A solution of this reduced problem is a weak saddle point of the Lagrange function 11
L = f + n ~ x l ( l n x , - l ) + A T ( A x -b )
(2.3)
I=]
where the weak saddle means that the function takes its minimum with respect to x and is flat with respect to the Lagrange multipliers X = [A, . . . , A,,] at that point. This saddle point is a solution of the following two equations
AX = b
(2.5)
Let us transform the variables (x. A) to (y. A) through introduction of a new variable 111
by which equation 2.4 is simplified into the form
The inverse of the transformation equation 2.6 reads
Assuming the differential equation we will give for y ( t ) does not exhibit finite escape time, equation 2.8 ensures nonnegativity of x. Thus the nonnegativity condition in equation 2.2 is recovered from the reduced problem without it. A solution of equation 2.7 can be obtained as a stable equilibrium state of the differential equation y1= -ayl
-
3f -
dX,
Kiichi Urahama
1064
where Y denotes d y l d t . The equality (equation 2.5) must be satisfied at any instant in the time course of equation 2.9 including an initial condition. This can be attained by instantaneous adjustment of X at every time on the trajectory of equation 2.9. Thus the algebraic-differential equations 2.5, 2.8, and 2.9 constitute a n analog solution system for the problem (equation 2.2) that gives a solution of equation 2.1 at sufficiently small o. Every trajectory of this algebraic-differential system passes through the interior of the feasible region and eventually arrives at a locally optimum solution of equation 2.2. Let us prove this convergence theoretically by showing that the Lagrange function (equation 2.3) is a Liapunov function for the dynamic system (equation 2.9) with equations 2.5 and 2.8. This proof utilizes the following observation: Lemma 1. The ineqziality y T i 2 0 alzoays holds along any trajectory of equation 2.9 zclith eqiiations 2.5 arid 2.8. Proof. Differentiation of equation 2.8 with time gives .? = D(Y - A T i / n )
(2.10)
where D = diag(sl.. . . .x,, ). Substituting this expression into the differentiation of equation 2.5 with time: (2.11)
A i = 0.
we get the expression for
as
X =r t ( ~ ~ ~ J ) - - l ~ ~ j
(2.12)
Substitution of equation 2.12 into equation 2.10 yields . = iD[I - AJ(ADAT)-’AD]j/
(2.13)
from which we get IjJi
=
y J D / I - AT(ADAJ)-’AD]g ~ J D 211 ’ - ~1 ’ A J ( A D A J ) - ’ A D ~ ’ ’
=
( D1 ‘3/) 3 ’ 71 ,I - BT(BBT)-IB]D1”!/
(2.14)
where B = AD’,’. The matrix P = I - B J ( B B J ) - ’ B in equation 2.14 is a projection matrix, hence P = P T P whose substitution into equation 2.14 leads to i,TX = ( D l : ~ , i / ) T p J p U2i/’ =
(PD’
2Y)
rPD’
2,/
20
This concludes the proof.
(2.15) 0
Equation 2.11 is verified again from equation 2.13 as A i = AD[[- AT(ADAT)-’AD]j = [A D - A D ] j
= o The following main theorem can be derived from this lemma:
(2.16)
Gradient Projection Network
1065
Theorem 1. The Lagrange function L in equation 2.3 decreases monotonically with time along an arbitrary trajectory of equation 2.9 with equations 2.5 and 2.8. Proof. Equation 2.9 can be rewritten in a simple form by . aL Y=-z
(2.17)
and also equation 2.5 reads
aL
x=O
(2.18)
Substituting these expressions into the time-derivative of the Lagrange function L
L=
(
;:)T.
x+
(
;:)T.
x
(2.19)
we obtain
r: =
(2.20)
-yTx
from which together with Lemma 1 it is concluded When N 2.13 reads
= 0,
5 0.
0
equation 2.9 reduces to y, = -af /dxi and then equation
X = -D[I - AT(ADAT)-'AD]G7f
(2.21)
.
where Of is the gradient vector [aflax,, . . . af/ax,]. This equation (2.21) is an analog solution system for the original problem (equation 2.1) for which Theorem 1 reduces to Corollary 1. When CY = 0, the objectivefunctionf decreases monotonically with time along any trajectory of equation 2.21. Proof. This corollary follows immediately from equation 2.21, since the matrix DII - AT(ADAT)-'AD]is positive semidefinite as shown in equation 2.14. 0 This corollary ensures that every trajectory of equation 2.21 converges to a local optimum solution of equation 2.1. Various experiments show unfortunately that this local optimum obtained by direct solution of equation 2.1 is not as good with rather large values of the objective function for highly complex problems such as NP-hard combinatorialoptimization problems. For such hard problems a deterministic annealing procedure is effective for obtaining a good approximate solution. The deterministic annealing refers to a process of continuous tracking of the solution of equation 2.2 with gradual decrease in the value of N to zero where a solution of equation 2.1 is obtained. The following statement describes a fundamental property for this procedure:
Kiichi Urahaina
1066
(2.22) whose differentiation with O'L 11s - -~ ds i.A'-.ri x ilx' d t r i1.y d(i from which we get
o
gives 0
(2.23)
(2.24) From equation 2 22 we obtain ds d x tution into equation 2 24 yields
{dfidy
7
A J X ) / ( iwhose substi-
(2.25) where (2.26) Since Adxido = 0, which is derived from equation 2.5, the following equation holds: (2.27) Substituting equation 2.25 into equation 2.27, we get the final expression (2.28) The matrix 02L/i)x2 is always positive semidefinite because the solution s is a local minimum of the Lagrange function L at any value of o . Thus it is concluded that
Gradient Projection Network
1067
Figure 1: Circuit for solving equation 2.2.
summation/subtraction circuits that output the voltage y, - CyL, a,,,) computed from the input voltages ,A1 . . . , A,, y1 as shown in Figure 2. This circuit block can be easily devised by using only one operational amplifier. The variable x, is given by the drain current of MOS transistors, which varies exponentially with the gate voltage of MOS transistors in the subthreshold range. Its functional form coincides with the expression of equation 2.8. These drain currents are then multiplied by a , and injected into central horizontal lines through current mirrors. The voltage of this common line corresponds to A, and the Kirchhoff’s current conservation law at this line reads C;=, a,,x, = b,, which coincides with equation 2.5. Thus the voltage A, of this line is always instantaneously adjusted to the value ensuring this equality. Equation 2.9 holds for the voltage y, of grounded capacitors. Resistance of grounded resistors in parallel with capacitors is l / a , i.e., its conductance is a. Current-controlled current sources denoted by rhombuses with an arrow whose current is aflax, reduce to constant current sources if the objective function f is linear, i.e., a linear program, and can be implemented with current mirrors when the function f is quadratic as reported previously [Urahama and Ueno 1993). Its implementation becomes hard if the order off is greater than two, but fortunately its order is at most two for almost all problems. Thus this circuit is shown to execute the present solution method. ~
Kiichi Urahama
1068
m
yi-
1aji A j=l
Yi -
Figure 2: Voltage summation/subtraction circuit. Let us now examine the stability of this circuit. If we take into account parasitic capacitors inevitably produced at fabrication of the circuit, the equation for the voltage of the horizontal lines changes from the algebraic, i.e., DC equation 2.5 into the differential equation: (2.29) r=l
where c is the total parasitic capacitance attached to the line. The value of c is extremely smaller than the grounded capacitors in Figure 1, hence equations 2.9 and 2.29 constitute a singular perturbation system. The relaxation time for X of equation 2.29 is neglectably shorter than that for y of equation 2.9. The voltage y is therefore almost constant through a time course of equation 2.29. Thus the equation 2.29 is the gradient system i)L dX,
cx, = -
(2.30)
which ensures stable convergence of X to an equilibrium state where the right-hand side of equation 2.29 becomes zero, i.e., equation 2.5 holds. Note the difference in the signs between equation 2.17 and equation 2.30. This discrepancy come5 from the property that the solution is a saddle point of the Lagrange function L: the minimum with respect to x while the weak maximum with respect to A. Therefore equation 2.17 is a gradient descent system while equation 2.30 is a gradient ascent system and these dynamics lead to correct convergence to the saddle point. Thus the stable and correct functioning of the circuit in Figure 1 is verified. Some experimental results for this circuit have already been reported for a winner-takes-all problem (Urahama and Ueno 1993) and linear assignment problems (Urahama 1994a,b). These reports, however, dealt with only small scale (3 x 3) problems. Here we simulated assignment problems with larger size up to 30 x 30. Computational time was proportional to u2 with n being the number of variables for linear cases, while it is proportional to 1~~ for quadratic assignment problems.
’
Gradient Projection Network
1069
Let us now note the role of resistors in Figure 1, which implements the first term in the right-hand side of equation 2.9. This term comes from the second term in the objective function in equation 2.2, which was extraneously added to equation 2.1, which is recovered if we set Q = 0, which corresponds to the infinite resistance, i.e., the removal of resistors in Figure 1, but then some voltages yi grow infinitely to eventually break down correct functioning of the circuit. These resistors are necessary to keep the voltages at finite values to prevent breakdown of the circuit. The resistance must be large enough to keep deviation of the solution from the true solution of equation 2.1 sufficiently small. This variation in the solution induced by addition of resistors has also been examined for assignment problems (Urahama 1994b). On the other hand, if we vary the value of resistances gradually, we can perform a deterministic annealing on the hardware, which gives a good approximate solution as stated in Theorem 2. The effectiveness of the annealing has been exemplified for some NP-hard combinatorial optimization problems (Urahama 1992). These roles of resistors are also the case for the Hopfield network whose circuit contains the same stabilizing resistors and the annealing is effective for obtaining good solutions. 3 Interior Point Method
It is obvious that the present solution system belongs to a class of interior point methods. Various interior point methods for linear programming originated from Karmarkar's algorithm can be derived from discretization of a differential equation with a variable timestep integration formula. Many interesting differential geometric properties have been investigated for a solution path of the interior point methods on the basis of this differential equation (Bayer and Lagarias 1989; Karmarkar 1990). This section is devoted to a similar derivation of equation 2.21. Let us first derive the projected gradient OGflp, which is a vector obtained by the projection of the gradient Of into the feasible region P = {x I Ax = 0} in a Riemannian space with the metric ds2 = C:'=lC;t=,g,dx,dx, defined by a symmetric matrix G = [8,,]. The target vector Vflp is the vector v,which maximizes ( V f ) T vamong vectors on the ellipse V'GV = E' in the feasible region Av = 0. This vector is given by the following nonlinear programming: min -(Of)'v subj.to vTGv = c2
AV = 0 whose solution is the saddle point of the Lagrange function
L
=
-(O~)'V + X(vTGv- E')
+ pTAv
Kiichi Urahama
1070
and satisfies the following three equations:
-Vf + ~ X G V + A ~ / I= 0 pTGu - E~ = 0 AU = 0 Equation 3.3 is solved as ZJ =
G-'(Vf -A7/1)/2X
(3.6)
Substituting this expression into equation 3.5 and solving for i/
=
(AG-'A')-'AG-~v~
11,
we get (3.7)
whose substitution into equation 3.6 yields the desired expression for the projected gradient vector
where 2X in the denominator is neglected because only the direction of the vector is relevant. It can be proved in a manner similar to Lemma 1 that (Vf)'zJ2 0, which states that the vector in equation 3.8 is directed along an upward slope of the function f . Hence we can arrive at a local minimum by proceeding in the direction reverse to zj. Expression 3.8 is derived for general Riemannian spaces; if we set the matrix G in a particular form of a diagonal one G = diag(x;'. . . . x i 1 ) = D-', then equation 3.8 coincides with the right-hand side of equation 2.21. Thus we know the vector field (equation 2.21) is a projected gradient in a special Riemannian space with a metric defined by the diagonal matrix D. Furthermore if the function f is linear f = c'x hence Of = c and G = diag(xT2.. . . r;*), then this vector field coincides with that for the affine scaling interior point method for linear programming (Bayer and Lagarias 1989). This observation leads us to the conclusion that the present method (equation 2.21) is an extension of the affine scaling interior point method to nonlinear programs with slight modification of the Riemannian metric. 4 Transformation of Variables with Logistic Functions
~
This section is addressed to 0-1 integer programs that can formulate almost all combinatorial optimization problems: min f ( x ) subj.to As = b x,E (0.1)
Gradient Projection Network
1071
whose continuous relaxation problem reads min f(x) subj.to Ax = b x, E [O. 11
(4.2)
The Hopfield network solves equation 4.2 with the aid of the penalty method, which is one of the exterior point methods. Here we are concerned with an interior point method as an alternative to those exterior point schemes. Although equation 4.2 can be solved indirectly by using the method presented in the previous sections after the reduction of equation 4.2 to the form equation 2.1 by introducing the slack variables X I as min f ( x ) subj.to Ax = b x, X I = 1
+
(4.3)
we devise here a direct solution method for equation 4.2 without introduction of such slack variables. The first step is, as it was for the problem (equation 2.1), addition of an entropic term to the objective functionf(x): I,
min
f(x) + a C [ x llnx,
+ (1
-
x , ) In(1 - x,)]
i=1
subi.to Ax
(4.4)
=b
XI E
[O. 11
whose solution is a saddle point of the Lagrange function L =f
+
3,
(Y
c [ x ,Inx,
+ (1- x,) ln(1
-
x,)]
+ XT(Ax
-
b)
(4.5)
i=1
The solution satisfies two equations:
AX = b Introduction of a new variable defined by
simplifies equation 4.6 into
The inverse of the transformation equation 4.8
(4.7)
1072
Kiichi Urahama
Figure 3: Circuit for equation 4.10.
(4.10)
called the logistic function ensures that xi is always between 0 and 1. Equation 4.9 is the same as equation 2.7, hence it can be solved with equation 2.9. Equation 2.13 is also derived from equations 4.7 and 4.10 for this case where the matrix D is diag[xl(l-XI). . . . .x i l ( l- x , ~ ) ]instead of diag(xl,.. . ,x l l )for the problem (equation 2.2). Thus all of the mathematical descriptions in the previous sections also apply to the present scheme only with the replacement of the matrix D by diag[xl(l-xl). . . . .x,i(l-xil)]. The solution of the problem (equation 4.4) can be obtained with the circuit in Figure 1 with replacement of grounded MOS transistors by two transistors connected as shown in Figure 3 and the remaining parts are unchanged. This modification realizes the conversion from equations 2.8 to 4.10. This analog circuit is a gradient projection network for the problem (equation 4.2). The logistic function (equation 4.10) is familiar in the neural network field and its simplest form has been used in the Hopfield network. The present network adds the ability to handle linear constraints to the Hopfield network. 5 Conclusion An analog solution method has been presented for nonlinear programming problems with linear constraints and this method has been implemented with current mode MOS circuits. Convergence of the present scheme has been guaranteed theoretically and this method has been shown to be an extension of the affine scaling interior point method for linear programs to nonlinear programs under slight modification of
Gradient Projection Network
1073
the Riemannian metric. Analysis of differential geometric properties of the solution path in Riemannian spaces is under study. References Arrow, K. J., Hurwicz, L., and Uzawa, H. 1958. Studies in Linear and Nonlinear Programming. Stanford University Press, San Jose, CA. Bayer, D. A,, and Lagarias, J. C. 1989. The nonlinear geometry of linear programming I, 11. Trans. Am. Math. SOC. 314, 499-580. Hopfield, J., and Tank, D. 1985. Neural computations of decisions in optimization problems. Biol. Cybern. 52, 141-152. Karmarkar, N. 1984. A new polynomial-time algorithm for linear programming. Combinatorica 4, 373-395. Karmarkar, N. 1990. Riemannian geometry underlying interior-point methods for linear programming. Contemp. Math. 114, 51-75. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. lnt. J. Neural Syst. 1, 3-22. Platt, J. C., and Barr, A. H. 1987. Constrained differential optimization. In Neural lnformation Processing Systems, D. Z. Anderson, ed., pp. 612-621. American Institute of Physics, New York. Urahama, K. 1992. Deterministic annealing in neural networks for combinatorial optimization. Proc. Int. Symp. Neural Inf. Process. 94-97. Urahama, K. 1994a. Analog method for solving combinatorial optimization problems. lEICE Trans. Fundamentals E77-A, 302-308. Urahama, K. 1994b. Analog circuit for solving assignment problems. lEEE Trans. Circuits Syst. 41(I), 426429. Urahama, K., and Ueno, S. 1993. A gradient system solution to Potts mean field equations and its electronic implementation. lnt. J. Neural Syst. 4, 27-34. Yuille, A. L., and Kosowsky, J. J. 1994. Statistical physics algorithms that converge. Neural Comp. 6, 341-356.
Received November 7, 1994; accepted December 19, 1995
This article has been cited by: 2. Youshen Xia, Mohamed S. Kamel. 2008. A Cooperative Recurrent Neural Network for Solving L1 Estimation Problems with General Linear ConstraintsA Cooperative Recurrent Neural Network for Solving L1 Estimation Problems with General Linear Constraints. Neural Computation 20:3, 844-872. [Abstract] [PDF] [PDF Plus] 3. Youshen Xia, Mohamed S. Kamel. 2007. Cooperative Recurrent Neural Networks for the Constrained $L_{1}$ Estimator. IEEE Transactions on Signal Processing 55:7, 3192-3206. [CrossRef] 4. Youshen Xia , Gang Feng . 2005. On Convergence Conditions of an Extended Projection Neural NetworkOn Convergence Conditions of an Extended Projection Neural Network. Neural Computation 17:3, 515-525. [Abstract] [PDF] [PDF Plus] 5. Chuangyin Dang , Lei Xu . 2002. A Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman ProblemA Lagrange Multiplier and Hopfield-Type Barrier Function Method for the Traveling Salesman Problem. Neural Computation 14:2, 303-324. [Abstract] [PDF] [PDF Plus]
Communicated by Pekka Orponen
A Numerical Study on Learning Curves in Stochastic Multilayer Feedforward Networks K.-R. Miiller* Department of Mathematical Engineering and lnf Physics, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 213, Japan M. Finke lnstitut fur Logik, University of Karlsruke, 76228 Karlsruke, Germany N. Murata Department of Mathematical Engineering and In$ Physics, University of Tokyo, Hongo 7-3-2, Bunkyo-ku, Tokyo 213, lapan
K. Schulten Beckman Institute, University oflllinois, 405 North Mathews Ave., Urbana 1L U S A
S. Amari Department of Mathematical Engineering and lnf Physics, University of Tokyo, Hongo 7-3-1,Bunkyo-ku, Tokyo 123, Japan Lab. f. lnf. Representation, RIKEN, Wakoshi, Saitama, 351-02, Iapan The universal asymptotic scaling laws proposed by Amari et al. are studied in large scale simulations using a CM5. Small stochastic multilayer feedforward networks trained with backpropagation are investigated. In the range of a large number of training patterns t, the asymptotic generalization error scales as l / t as predicted. For a medium range t a faster l / t 2 scaling is observed. This effect is explained by using higher order corrections of the likelihood expansion. It is shown for small t that the scaling law changes drastically, when the network undergoes a transition from strong overfitting to effective learning. 1 Introduction
Recently a growing interest in learning curves, i.e., scaling laws for the asymptotic behavior of the learning and generalization ability of neural networks has emerged (Amari and Murata 1994; Barkai et al. 1992; Baum and Haussler 1989; Haussler et al. 1994; Murata et al. 1993; Opper et al. 1990; Opper and Kinzel 1995; Saad and Solla 1995a,b; Seung et al. 1992; Sompolinski et al. 1990). Clearly, as soon as learning is applied, we observe the characteristics and the performance of the learning algorithms *Permanent address: GMD FIRST, Rudower Chausse 5, 12489 Berlin, Germany.
Neurul Computation 8, 1085-1106 (1996) @ 1996 Massachusetts Institute of Technology
K.-R. Muller et al.
1086
in terms of generalization and training error. Therefore, it is important to study the bounds on how fast we can learn as a function of the number of parameters in general. The large-scale simulations presented in this paper are addressing the question of scaling laws for training and generalization errors in small multilayer feedforward networks with so far up to 256 parameters, trained on a finite number of training samples of up to 32,768 patterns. We address the teacher-student situation, i.e., given a teacher network, a student network of the same architecture learns from the examples generated by the teacher. So far a number of groups have used statistical mechanics and the replica trick to find the scaling properties of the generalization ability, first for simple perceptron systems, and recently for tree-like networks with hidden units (for reviews see Heskes and Kappen 1991; Opper and Kinzel 1995; Saad and Solla 1995a,b; Seung et al. 1992; Watkin et al. 1993). A further approach for estimating asymptotic learning curves is the computational one, where the VC dimension is used to measure the complexity of a given problem (Baum and Haussler 1989; Haussler et a/.1994; Opper and Haussler 1991). We would like to adopt the viewpoint of information geometry, which provides an alternative method for estimating the asymptotic behavior of learning based on an asymptotic expansion of the likelihood of the estimating machines, always assuming a maximum likelihood estimator (Amari and Murata 1993; Murata et al. 1994). In this paper we examined whether the well-known universal asymptotic scaling laws found by Amari et al. can be observed in a simulation of a finite continuous network and a finite number of continuous training patterns. According to this theory the scaling law fs = H"
m +2t
(1.1)
holds for general stochastic machines with smooth outputs (Amari and Murata 1993; Murata et al. 1993). The quantity fcT denotes the averaged likelihood (generalization ability), rn is the number of parameters of the model (bias + weights), and t is the number of training examples presented to the network. Emphasis is set to the issue of evaluating whether these asymptotic results have an impact on the practical user of neural networks. Also the question of where asymptotics starts is addressed. A further point of interest is to obtain insights about the dynamics of the hidden units during the learning process. In our simulations we use standard multilayer continuous feedforward networks, trained with backpropagation and a conjugate gradient descent in the Kullback-Leibler divergence. The next section describes the model investigated. The technical details of our simulations are given in Section 4 and higher order corrections to equation 1.1 are presented in Section 5 . Section 6 discusses the numerical scaling results, and finally a conclusion is given.
Numerical Study on Learning Curves
1087
2 The Model
We use standard feedforward classifier networks with N inputs, H sigmoid hidden units, and M softmax outputs (classes). The output activity 0, of the Zth output unit is calculated via the softmax squashing function
where
and where hp = C, wp,- 19f is the local field potential. Each output 01 codes the a posteriori probability of an input pattern being in class Cl, Oo denotes a zero class for normalization purposes. The rn network parameters consist of biases 19 = (iJH. 8 ' ) and weights w = (wH.w'). When x = ( X I . . . xN) is input, the activity s = ( ~ 1 . .. . ,s H ) is computed as ~
The input layer is connected to the hidden layer via wH,the hidden layer is connected to the output layer via wo, but no short-cut connections are present. The network approximates the probability distribution of the outputs (Finke and Miiller 1994). Therefore, each randomly generated teacher wT represents by construction a multinomial probability distribution q(C[ 1 x, wT) = Prob{x E Cr} over the classes Ct ( I = 1 . . . M ) given a random input x. We use the same network architecture for teacher and student. Thus, we assume that the model is faithful, i.e., the teacher distribution can be exactly represented by a student 9(C/ I x) = p(C1 I x. wT). A training and test set of the form S = { (x'. c p ) 1 p = 1 . . . t } is generated randomly, by drawing samples of x from a uniform distribution and forward propagating xP through the teacher network. Then, according to the teachers' outputs q(Cy 1 xp) one output unit is set to one stochastically and all others are set to zero leading to the target vector yp = (0,.. . 1... . , 0). A student network w is then trying to approximate the teacher given the example set S. For training the student network w we use a backpropagation algorithm with conjugate gradient descent to minimize our objective function: the Kullback-Leibler divergence ~
Here q(C1 I x) denotes the class conditionals with respect to outputs of the teacher and p(C1 I x, w) are the class posteriors as approximated by
K.-R. Muller et al.
1088
the student network. The Kullback-Leibler divergence is the natural objective function to measure the degree of coincidence of the teacher and student distributions 9 and p . To measure the Kullback-Leibler (KL) divergence one has to know the stochastic source underlying the data set that can be decomposed into the input-generating part 9(x) and the output probability distribution 9( CI 1 x). In practical applications there is typically no such knowledge. So in our training procedure only the log likelihood Fl
=
-
will be available, using the empirical joint distribution
L7x(x’ c W l )
=
I }’-I t {
1: 0:
to evaluate equation 2.3;
CP
x = XI’ and C,,, = cF’
otherwise
refers to the correct class label associated to
Xl’.‘
Our results based on training with equation 2.4 have practical importance, since as mentioned above, in general practical problems only the empirical distribution is known. On the test set we use a better approximation to the KL divergence by sampling equation 2.3 for which all necessary ingredients are known
x [9(Ci 1 x”)lnp(Cl1 x ” . w ) - 9(C1 1 x”)Inq(C~ 1 xr’.w)]
(2.5)
So given a random uniformly distributed input, we can use the a posteriori probabilities q(CI 1 XI’), which are exactly the output values given by the teacher networks on the presentation of an input vector XI’. 3 Order Parameters ______
For the committee machine several authors have observed a phase transition, where the generalization error first scales as N / t in a so-called symmetric phase whereas for more patterns a transition takes place and the system scales as NH/f in the symmetry broken phase (Barkai et a/. 1992; Schwarze and Hertz 1993; Seung et 01. 1992; Kang et al. 1993; Saad and Solla 1995a,b). Below the transition all hidden units learn uncorrelated to each other and to all the teacher hidden units (see Fig. la). ~~
~~
2.1 instead of equation 2.3 because minimizing the KL divergence and minimizing - / ‘ d x q ( x ) q ( C ,1 X I Inp(C1 j x. w ) differs only by a constant and is therefore equivalent. In the learning situation, only the set of training examples is availablc, so we have to use equation 2.4 ’Wc
L J Sequation ~
x;:,
Numerical Study on Learning Curves
1089
Figure 1: Schematic picture of the weight vectors of the student and the teacher (a) before and (b) after the transition from uncorrelated to correlated learning.
Above the transition every student hidden unit decides for one teacher hidden unit and is maximally uncorrelated with the other teacher hidden units (see Fig. lb). We would like to determine whether this transition also occurs in continuous multilayer feedforward networks being trained with continuous patterns. We therefore define a set of order parameters that allows a more careful inspection of the correlations between student and teacher than the Kullback-Leibler divergence.
3.1 Angle-Based Order Parameters. In the committee machine the overlap
and the self-overlap describe the dynamics of the hidden units during learning, where we used the abreviation wgw = (w:, . . . wEN). To have only one parameter we consider all permutations c of the hidden units in the multilayer perceptron case to make the overlap independent of the actual permutation. In our case the weights have to be normalized, since our system is not binary. Let wgw and w:!,~. be the vectors of all weights from the input layer into hidden unit i or teacher and student, respectively, and let w& and w:,(~)denote the weight vectors from hidden unit i to all output units. Based on this notation we define two measures 1 .
K.-R. Muller et al.
1090
for the correlation of the weight vectors
where max, is the maximum over all possible permutations (T of the hidden units. In other words, we consider the overlap of the hidden units given a permutation such that the weights of the hidden units o f the teacher and the student are maximally correlated. A transition from uncorrelated to correlated learning, as mentioned above, would be detected as a change of the angles between teacher and student vectors. 3.2 Length-Based Order Parameters. The order parameters introduced in the last section essentially measure the angle between teacher and student machine. Now we have to take into account that we do not deal with binary weights, which are nicely normalized, but with students who can change the lengths of their parameter vectors quite drastically in the dynamics of the learning process. We therefore introduce a new set of order parameters based on the ratio between the teacher and student weights
3.3 Correlation-Based Order Parameters. Since the hidden units implement functions, we measure additionally the functional ?f norm, which corresponds with the correlations between the hidden units activities
(3.3) The sum is taken over the test set and sTI denotes the activity of the ith hidden unit of the teacher while s, is the student’s activity at hidden unit j (cf. equation 2.2). A value of Act H,, 0 corresponds to a maximal correlation between student and teacher. This parameter gives a very clear picture of the dynamics of the functional distance between teacher and student hidden units during learning.
-
Numerical Study on Learning Curves
1091
3.4 Output Measure. As a last-order parameter we consider the extremality of the output activities
Ext
1 #set
= __
l M mint [l- O,(X~’)]~, [O/(xy) - 0)’)
M
(3.4)
The sum over p is taken either over the training or the test set and we normalize over the cardinality of the respective set (denoted by #set). The quantity Ext measures how strongly the network fits the extreme values of the targets, so if the network outputs are close to either 0 or 1 we obtain Ext 0. In this sense Ext is a measure of overfitting, assuming a smooth posterior q(C, I x, wT) of the teacher. As Ext takes nonzero values the student network starts to provide a smooth estimate of the a posteriori distribution of the teacher. N
4 The Simulation
The simulations were performed on a parallel computer (CM5). Every curve in the figures takes about 3-5 hr of computing time on a 128 or 256 partition of the CM5. This setting enabled us to do the statistics for a single teacher over 128-512 samples (different training sets). The exact conditions under which our simulations were performed are as follows: 1. A teacher network W T is chosen at random, where weights and biases are normally distributed with zero mean and variance 1. 2. Then a random training set of size t and test set with fixed size 100,000 is drawn by choosing xP from a uniform distribution of appropriate width. The output distribution q(C/ 1 x, wT) is generated by the previously chosen teacher W T and the 1 out of M class target vectors yp are generated stochastically. 3. A student w is initialized randomly or as the teacher configuration wT. Conjugate gradient learning with linesearch on the log likelihood equation 2.4 is applied. Given the student has reached a local minimum of the training error (equation 2.4) we assess the different order parameters of equations 3.1-3.4. 4. Furthermore the generalization ability of the student is measured on the test set via equation 2.5. 5 Higher Order Corrections
To obtain the asymptotic theory for the learning curve of the student networks w we have to expand the likelihood function (KL divergence) around the teacher WT following (Amari 1985; Amari and Murata 1993; Murata et al. 1994; Akahira and Takeuchi 1981). We now give the results for the higher order corrections to the asymptotic expansion yielding a
K.-R. Muller et al.
1092
refined scaling law, not only consisting of equation 1.1, but of higher order terms, responsible for the deviations seen in the simulation. i11
f,
Ho + 2t
- A + higher order terms -
t2
(5.1)
The lit2corrections have a prefactor A , which is very complicated and unfortunately strongly model dependent. The first m / 2 t term is model independent. The variance of the first order term in F , ~has the form (T = (m/2t2)--' '. The complete correction term A is discussed in the Appendix. 6 Results
In our simulations we can distinguish between three ranges of t , which will be described subsequently. First we summarize the general picture and then we relate this picture to the numerics.
1. Small t : in this range we observe strong overfitting, which induces diverging weights and generalization error, whereas the simulations typically show a finite generalization error due to finite numerical precision and the flatness of the error surface. 2. Medium t : a l / t 2 scaling is observed. So far, neither the statistical physics predictions nor statistical considerations have addressed the scaling of learning curves in a medium range of t. We propose necessary higher order corrections that have to be taken into account to explain the phenomena. 3. Large t (asymptotic range): the asymptotics underlying equation 1.1 are observed in the range of a large number of patterns. 6.1 Few Examples: Overfitting. In the following we will first give a theoretical explanation of the small t range and then report on our experimental findings. 6.1.Z why owrfitting? TIzeorefical Consideratiofis. For small t we are below storage capacity. A network is considered to operate below storage capacity if the student can reproduce the correct labeling on the training set with probability 1 and can therefore classify all given training patterns without error. The best and global solution of the learning problem in this case is one output set to 1, all others equal to zero, and diverging weights. If the weights diverge, the generalization error is bound to go to infinity. For a fixed architecture the limit of storage capacity depends on the specific sample. Above storage capacity-as the student cannot classify all training patterns correctly for a given sample-a minimum with finite generalization error and finite weights becomes favorable. In Figure 3
Numerical Study on Learning Curves
1093
we plot the probability r for finding a finite minimum, computed by averaging over a large number of samples. As we see for t > 2m all students end up in a finite minimum with probability r = 1. At t rn about half of the students are giving perfect classifications (r = 1/2), and therefore diverging generalization error. So r is not only a good parameter to detect the limit of the storage capacity of the classifier, but Y < 1 can be used as an indication for a diverging generalization error. So for t < 2m, the averaged generalization error should always be infinity, according to our theoretical considerations. On single samples we can of course obtain a finite generalization error, if a student cannot classify all training patterns correctly. In the range of small t an analogy to the transition found for the committee machine by means of statistical mechanics in the thermodynamic limit (Barkai et al. 1992; Schwarze and Hertz 1993; Seung et al. 1992; Kang et al. 1993; Saad and Solla 1995) could be the transition from infinite to finite weights, with respect to KL divergence.
-
6.1.2 Experimental Results. Plotted in Figure 2a is the Kullback-Leibler divergence found in the simulation for a 108-parameter network (8-84).* Obviously the generalization error is not diverging. This result is typical for a practical simulation, which is limited due to finite precision and the flatness of the error surface. For t < m the student overfits strongly with outputs tending to take the extreme values 0 or 1 to imitate the empirical distribution 9 * ( x . Cv,). As one student output tends to 1, the others tend to zero. The value for the extremality parameter-also observed in Figure 3 of the simulationin this situation is Ext 0 before the bend of the Kullback-Leibler divergence (near t m ) and Ext > 0 after the bend. Taking extreme output values is possible only if the student weights increase drastically. Although we cannot see the expected diverging weight values, we observe in Figure 4 that the size of the student weights is very large, until after the transition point it approaches a magnitude similar to the teacher's weights. The measure ratio 0 shows a nice agreement with the shape of the generalization error, while ratio H approaches its maximum value after the transition near t m. As more examples are learned and the point t = m is passed, we observe a knee in the learning curve and a decrease of the absolute values of the student weights. For larger networks the knee steepens up and the change in weight size becomes more prominent. The overfitting is a result of the fact that smooth networks can always fit the data exactly when t < m. In the region of the bend in the KL divergence at m < t < 4m we find a change in the scaling behavior toward a faster scaling law. In this range the outputs start to take nonextreme values and the parameter
-
-
-
'For the 8 - 8 4 network we compute the number of free parameters as m =
(N+ l)H + ( H + l ) M = 108.
K.-R. Miiller et al.
1094
Tudrr nSludmt: 6-84 10 9
8 7
i! '
$-
3 j 3
6 5 4
3 2 1 .. ...
0
..
.
.
.... . .
.
... . ..
.
....... .-
0.02
O.UZ.3
.
.
...... .
I 0
0.005
0.01
0.015
0.03
0.035
Ih
Figure 2: Plotted are the simulated generalization values over l / t for an 8-84 network. We compare the start from the teacher ZLIr and a random initialization (a) for the whole learning curve and (b) for the asymptotic area. Note that in the asymptotic range we find for the random started simulation higher values for the KL divergence, i.e., the simulation gets stuck earlier in local minima.
Numerical Study on Learning Curves
1095
1
0.1
0.001
Figure 3: Ext measured on training and test set indicates whether the output activities take extreme values as a function of l / t (8-84 net). A value of zero indicates extreme output values, i.e., 0 or 1. Compared to Ext is the probability r of wrong classification on the training set, for r = 0 only a diverging KL divergence is a valid solution, for r = 1 a finite minimum is more favorable. r is a good parameter to detect the limit of the storage capacity of the classifier.
Ext shows a sharp bend, since more examples are provided to give a smoother estimate of the a posteriori distribution of the teacher. Also the probability r of finding a finite solution tends to 1 and for t > 2m numerical effects do not have to be considered anymore. We determine that the activities and angles of the teacher and student hidden units are still uncorrelated, i.e., the student hidden units do not correlate to specific teacher hidden units. We conclude that overfitting effects dominate the small t region to a large extent. They can be measured through the order parameters ratio 0 and Ext. The region where the average generalization error actually diverges theoretically can be estimated by Y. We would like to emphasize that below storage capacity numerical effects that act as regularizers de-
K.-R. Miilier et al.
1096
Generalisation mor (KL) vs. Ratio (8-8-41)
14
t
I
1
ratio0 4ratioH -t-. KL -0.-
12
10
8
.-
1 5
6
3 4
2
0
-2
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
I I1
Figure 4: Ratio of the student and teacher weights of hidden to output units (ratio0)and input t o hidden units (ratio H) versus Kullback-Leibler divergence as a function of l / t (8-84 net). Note the strong increase of ratio 0 at the bend of KL near t 111.
-
pending on implementation details' will typically be observed and are hard to be circumvented. 6.2 Medium Range: Many Examples. For 4rri < t < 3Onz we find a scaling law of l / t 2 , which is faster than l i t . Yet, the exponent is slowly decreasing toward t-' as t is growing toward the large t regime. The higher order corrections of equation 5.1 can explain this effect: the farther we are away from the l / t asymptotics the more prominent are the correction terms of equation 5.1. Note that the above mentioned value 4171 for the onset of the I/t2 asymptotic region is specific to the example (8-84) used frequently in this paper, since the parameter A from equation 5.1-determining the onset-is unfortunately strongly model dependent (see Appendix). In 'In o u r case the linesearch and the bracketing subroutines have tolerance bounds for the gradients with respect to the log likelihood equation 2.4. These act implicitly as regularizers.
Numerical Study on Learning Curves
2 2-5
1097
t
I
l/t
Figure 5: Plotted are the simulated generalization values over l / t for an 88-4 network. For large t an exponent of the scaling law smaller than -1 is observed. Shown are the simulated values minus m/2t. Above t = 3000 we find the scaling predicted in equation 1.1,eg., the points are on the line t8 = 0. Below t = 3000 a quadratic interpolation is applied, yielding the necessary higher order corrections of equation 1.1. Figure 6 we can see the asymptotic region for a number of different networks as a function of m/t. Clearly the range of the l / t regime is completely different for different network configurations. To have a better impression of the quality of the t-' and t-2 scaling, we subtracted 108/2t from the data points in Figure 5 and clearly see eg = 0 for t > 3000 while for t > 400 a tP2 fit can be nicely applied. In the following we will use the term correlation synonymously with the functional distance or the angle. In the t-2 range, quantitatively the correlations (angle rH) between teacher and student weights show a transition from a state where the hidden units of the student and the teacher are initially correlated to a certain extent (rH = 0.63) toward asymptotic alignment (rH =l;cf. Fig. 7). Furthermore, if we consider the functional distances Act Hij in Figure 7, we observe an initial overall similar functional distance between student and teacher hidden units ranging from 0.15 to 0.4. For larger t this distance is decreased to zero for one hidden unit, while the others maintain a similar magnitude ranging from 0.15 to
K.-R. Muller et al.
1098
0.12
4
8
&2-1,.4b8-44 +--
0.1
-
0.08
-
0.06
-
,'
Y B P Y .-
I
,..d.'
.#
./.'
./'
I
-0.02
0
88-1 - 6 . w-1..*.... ,468-1 &-..'16-12-1 +.16-16-1 4 -
0.005
0.01
0.015
0.02
0.W
mn
0.03
0.035
0.04
0.MS
0.05
Figure 6: Kullback-Leibler divergence as a function of nr/t for different network sizes as indicated in the key. Asymptotically all curves coincide. Furthermore, note the different onset of the l / t ' region for the different network sizes. 0.35 as before. This effect would also be a candidate for the transition in the committee machine (Barkai et a/.1992; Schwarze and Hertz 1993; Seung et a/.1992; Kang et al. 1993; Saad and Solla 1995a,b), although it is by no means similarily abrupt and has to be observed in several order parameters (angle, functional distance, and ratio) as proposed above (see also Section 6.1.1). Note that practical applications have usually access to a data size > 5ni*, where m' is the number of effective parameters in the network. So under the conditions pointed out in Section 3 we will observe in most practical situations a knee in the learning curve and a faster scaling than l / t , i.e., the exponent of t is smaller than -1 and higher order correction terms have to be taken into account to explain this effect. 6.3 Asymptotic Behavior: Extensively Many Examples. As the asymptotic range is reached slowly, the higher order terms lose their importance and the law stated in equation 1.1 is approached. All net-
1099
Numerical Study on Learning Curves
Generalisation mor (KL) YS. Ratio ( 8 - 8 4 I
0.8
6
0.6
a ? d
0.4
y
0.2
0
0
0.001
0,002
0.003
0.004
0.005 1/I
0.006
0.007
0.008
0.Cf~g
0.01
Figure 7 Angle between student and teacher weights of input to hidden units (rH) versus L2 functional distance between the activity of student hidden unit 1 and all teacher hidden unit activities versus Kullback-Leibler divergence (KL) as a function of l / t (8-84 net). works studied exhibit a m/2t scaling in their asymptotic range.* In the Figure 8a and b we show in particular the 8-8-4 result with an interpolated slope of 57 and the 16-10-4 net (212 parameters) with a slope of 104. Clearly the interpolated region of m/2t is reached at higher t (t > 5000) in the larger system. In even larger networks (e.g., 16-124) the asymptotic region will shrink and will eventually not be reached for the maximum number of patterns 32,768 considered in our simulation. In this case one always has to rely on higher order corrections of the scaling law (equation 5.1). In Figure 6 we plotted the KL divergence as a function of m/t. For large t all curves coincide with a slope of 1/2. 6.4 Initialization. Most of the figures report on the simulation scenario, where we trained the student network starting from the teacher 4E.g.,16-44 slope, 47; 16-84 slope, 98; 16-104 slope, 104; 8-8-4 start from teacher slope, 57; 8-8-4 start from random initialization slope, 56.
K.-R. Muller et al.
1100
a-a-4
0.08.
c
0 -A
U
0.06,
!
-4 r(
a3
0.04.
k
@
/
GIn 0.02.
16-10-4
l/t
Figure 8: Plotted are the simulated generalization values in the asymptotic range for (a) the 8-8-4 network (108 parameters) and (b) for the 16-104 network (212 parameters). In both cases a clear scaling as l / t is seen.
Numerical Study on Learning Curves
1101
configuration WT. The idea was that since we consider a local neighborhood of the maximum likelihood estimator in the asymptotic case, the teacher would be a good starting condition for training. Figure 2a shows the complete learning curve of an 8-84 network comparing this initialization of the student to a random one. Except for the asymptotic range both initializations always yield very similar results. From this we conclude that no matter where we start in phase space, the dynamics of learning is always attracted to a local minimum of similar quality as in the case of a start from wT.The detailed picture of the asymptotic range is given in Figure 2b. Clearly, starting from a random initial state makes the learning converge to a higher local minimum in the generalization error only in the asymptotic range. Nevertheless, since the asymptotic theory is valid in any local minima close to the teacher, we observe the same asymptotic m/2t scaling for the random initialization as for a start from the teacher (cf. Fig. 2b). Note however, that the learning speed is increased by 20% using the teacher as initial starting point of learning.
7 Discussion and Outlook
In our numerical study we observed a rich structure in the learning curves of continuous feedforward networks. For a small number of patterns we find a phase of strong overfitting, where the outputs take extreme values in their estimate of q(Cl I x, WT) (Fig. 3) and the student can classify all training patterns correctly. We are below storage capacity of the classifier, so the weights and the generalization ability should theoretically diverge. This fact is not observed in a typical simulation due to numerical effects of finite precision (inducing an implicit regularization) and the flatness of the error surface. As the number of patterns increases beyond storage capacity, the Kullback-Leibler divergence also theoretically reaches the finite value found in the simulation and the outputs start estimating smoother probabilities. The size of the student weights becomes comparable to the teacher weights. The bend of the learning curve is followed by a region of l / t 2 scaling when t is increased. Asymptotically we confirm the m/2t behavior. From our results it seems important to reach the l / t 2 phase as fast as possible to learn efficiently without overfitting and to obtain a smooth estimate of the a posteriori distribution. Furthermore, as a smooth estimate is obtained, the network is finally free to learn in a collective manner, i.e., the activity of one student hidden unit becomes highly correlated to one specific teacher hidden unit (Fig. 7). Practical applications usually have access to data sets large enough to enter the l / t 2 range. If maximum likelihood training and no early stopping method i s used-according to our results-typically both, a knee and a faster scaling in the learning curve should be observed. Yet,
K.-R. Muller et al.
1102
the range of the asymptotic l / t scaling seems to be too far from realistic sizes of data sets available to most practical users of neural nets. We would like to emphasize that we always find a faster scaling than l / t between the small t overfitting phase and the asymptotic phase. For this reason model selection criteria that are usually based on asymptotic or certain overall assumptions on the smoothness of learning curves are likely to perform weakly, since they neither capture the transition encountered nor the faster scaling observed (see also Kearns et al. 1995). Further investigation is focused on the measurement of scaling laws in a real practical application and on algorithms that use early stopping to avoid overlearning or overfitting effects (Amari et al. 1995, 1996).5 Appendix We now describe the details of the asymptotic theory for the higher order corrections. The conditions for an asymptotic evaluation of are t i large and a realizable teacher machine with parameter WT. The present framework can be readily extended to unrealizable cases. A.1. Asymptotic Distribution of the m.1.e. w. Let us normalize the maximum likelihood estimator (m.1.e.)w as W = &(W - W T )
Then, the error W is asymptotically normally distributed
where p(W;G) =
1
JGFH
with mean 0 and variance matrix ($), Fisher information matrix G = (gij) glj
=E
[
d210gp(Ci,X ; W T ) awlawi
where (g’j) is the inverse of the
1
The higher-order Edgeworth expansion gives ~(W;WT) =
p(W;G){l +At(*)}
5Further information on related research can be found at http: / /www,first.gmd.de/persons/Mueller.Klaus-Robert.html.
Numerical Study on Learning Curves
1103
1 + -Kl,~-K/r,lrrk'~~'"'" 18
etc., where parentheses attached to indices denote symmetrization with respect to the indices inside the parentheses. The Edgeworth expansion of asymptotic distributions of m.1.e. was given by many researchers in the eighties, e.g., Akahira and Takeuchi (1981), Amari (1985). Amasi gave its geometric interpretation in the framework of curved exponential families. From this, we have the moments of the error in parameter space w = &(W - WT).
1 E[W] : E[w'] = --K'
\/i
E[WWT] :
E[W'7/1J] = g"
1 + --A'/
t
1
€[www]: E[G,3Wk]= -A'/k
JI
E[WWww] : E[5'7/1JWk~/'] = 3g("$') + -A'Ik' 1 t
where A's are given explicitly in Akahira and Takeuchi (1981) and Amari (1985). A.2. Expansion of the Kullback-Leibler Divergence Let w Aw. Then, by Taylor expansion, we have
= WT
+
K.-R. Muller et al.
1104
where x implies hereafter the pair ( x . C,) and liw) pansion, we have llw)
= l(WT+lW)
Hence we arrive at
where for example L,, is given by
L,= E,,
i
a 2 log pc
___
c,.x; WT)
87OI8i11,
1
Therefore the expansion of :v is given as fy
=
E[D(wT.w ) ]= E
1 .
=
logp(x.w ) . By ex-
Numerical Study on Learning Curves
1105
where
This gives the higher-order correction to the learning curve, €8
= =
(-lOgp(Cl,x
m
A
2t
t2
I W))= EioE(x,c,)[-logp(C,.x 1 W)]
Ho + - + - + higher-order terms
The result is also confirmed by Komaki (1994), where he obtained the Kullback-Leibler divergence with the modification of the predictive distribution by the normal mixture direction. When the normal correction is put equal to 0, his result gives
m
D(wT.W) = --
+
A -
+0
2t 4t2 where A is explicitly obtained. It includes the curvature terms, bias gradient terms, geometric and fourth cumulant terms, etc., in agreement with that given by Amari (1985).
Acknowledgments We would like to thank the participants of the NNSMP and the Snowbird workshop for fruitful and stimulating discussions. K.-R. M. thanks S. BOs, T. Heskes, and A. Herz for valuable discussions and for warm hospitality during his stay at the Beckman Institute in Urbana, Illinois. We further gratefully acknowledge computing time on the CM5 in Urbana (NCSA) and in Bonn. This work was supported by the National Institutes of Health (P41RRO 5969) and K.-R. M. is supported by the European Communities S & T fellowship under contract FTJ3-004. References Akahira, M., and Takeuchi, K. 1981. Asymptotic Efficiency of Statistical Estimators: Concepts and Higher Order Asymptotic Efficiency. Springer, New York. Amari, S. 1985. Differential Geometrical Methods in Statistics, Lecture Notes in Statistics No. 28. Springer, New York. Amari, S., and Murata, N. 1993. Statistical Theory of Learning Curves under Entropic Loss Criterion. Neural Comp. 5, 140-153. Amari, S., Murata, N., Miiller, K.-R., Finke, M., and Yang, H. 1995. Asymptotic statistical theory of overtraining and cross-validation. University of Tokyo Tech. Rep. METR 95-06. l E E E Trans. Neural Networks (submitted). Amari, S., Murata, N., Miiller, K.-R., Finke, M., and Yang, H. 1996. Statistical theory of overtraining-Is cross-validation effective? In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds. MIT Press, Cambridge, MA. Barkai, E., Hansel, D., and Sompolinsky, H. 1992. Broken symmetries in multilayered perceptrons. Phys. Rev. A 45, 4146.
1106
K.-R. Muller et al.
Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neiiral Coriip. 1, 151. Finke, M., and Muller, K.-R. 1994. Estimating a-posteriori probabilities using stochastic network models. In Proceediiigs sf tlre 2993 Coiiiiectionist Models Siriiimer School, M. Mozer, P. Smolensky, D. S. Touretzky, J . I>. Elman, and A. S. Weigend, eds., p. 324. Erlbaum, Hillsdale, NJ. Haussler, D., Kearns, M., Seung, S., and Tishby, N. 1994. Rigorous learning curve bounds from statistical mechanics. In Proc. c!f COLT, pp. 76-87. Heskes, T. M., and Kappen, B. 1991. Learning processes in neural networks. Pliys. Rev. A 440, 2718. Kang, K., Oh, J.-H., Kwon, C., and Park, Y. 1993. Generalization in a two-layer network. Plrys. Rez? E 48, 4805. Kearns, M., Mansour, Y., Ng, A. Y.,and Ron, D. 1995. An experimental and theoretical comparison of model selection methods. In Proc. qf COLT. Komaki, F. 1994. 011Asytrptotic Properties of Pvedictizie Distributioirs, METR94-21. University of Tokyo, Tokyo, Japan. Kuhlmann, I?, and Muller, K.-R. 1994. 1. Phys. A: Math. Cen. 27, 3759-3774. Muller, K.-R., Finke, M., Murata, N., Schulten, K., and Amari, S. 1995. On large scale simulations for learning curves. In Proceedings of tiir CTP-PBSXI Workslzop oii Theoretical Ph!ysics: Neirral Netillorks, The Sfntistical Meclianics Pimpectiue, pp. 73-84. World Scientific, Singapore. Murata, N., Yoshizawa, S., and Amari, S. 1993. Learning curves, model selection and complexity of neural networks. In NIPS 5, p. 607. Morgan Kaufmann, San Mateo, CA. Opper, M., and Kinzel, W. 1995. Statistical mechanics of generalization. In Physics qfNeirral Netilwks llI, E. Domany, J. L. van Hemmen, and K. Schulten, eds. Springer, Heidelberg. Opper, M., and Haussler, D. 1991. Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise. In Proi. ofCOLT, pp. 75-87. Opper, M., Kinzel, W., Kleinz, J., and Nehl, R. 1990. On the ability of the optimal perceptron to generalize. I. Phys. A: Math. Geri. 23, L581. Saad, D., and Solla, S. 1995a. On-line learning in soft committee machines. Phys. Ren E 52, 4225. Saad, D., and Solla, S.1995b. Exact solution for on-line learning in multilayer neural networks. Ph!/s. Reil. Lett. 74, 4337. Schwarze, H., and Hertz, J. 1993. Euroyhys. Lett. 21, 785. Seung, S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rez? A 45, 6056. Sompolinsky, H., Tishby, N., and Seung, S. 1990. Phys. Rev. Lett. 65, 1683. Watkin, T. L. H., Rau, A,, and Biehl, M. 1993. The statistical mechanics of learning a rule. R m . Mod. Plzys. 65, 499. .. ~ _ _ _ Recewed June 23, 1995, accepted December 18, 1995 ~
This article has been cited by: 2. Koji Tsuda, Shotaro Akaho, Motoaki Kawanabe, Klaus-Robert Müller. 2004. Asymptotic Properties of the Fisher KernelAsymptotic Properties of the Fisher Kernel. Neural Computation 16:1, 115-137. [Abstract] [PDF] [PDF Plus] 3. Siegfried Bös. 1998. Statistical mechanics approach to early stopping and weight decay. Physical Review E 58:1, 833-844. [CrossRef]
Communicated by Maxwell Stinchcombe
Rate of Convergence in Density Estimation Using Neural Networks Dharmendra S. Modha Elias Masry Department of Electrical and Computer Engineering, University of California at Sun Diego, 9500 Gilman Drive, La Jolla, C A 92093-0407 USA Given N i.i.d. observations {X,}& taking values in a compact subset of Rd, such that p' denotes their common probability density function, we estimate p* from an exponential family of densities based on single hidden layer sigmoidal networks using a certain minimum complexity density estimation scheme. Assuming that p* possesses a certain exponential representation, we establish a rate of convergence, independent of the dimension d, for the expected Hellinger distance between the proposed minimum complexity density estimator and the true underlying density p*. 1 Introduction Let { X l } E = ,be independent and identically distributed ( i i d . ) random variables taking values in [-;. ;Id c P'.Let p': !Rd + R+ denote their common probability density function (density), which is assumed to exist. X,}:, drawn from {X,}E,, we are interested in Given N observations { estimating the probability density function p'. Density estimation is a central problem in important applications such as data compression (Barron and Cover 1991; Rissanen 1989; Rissanen et al. 1992) and pattern recognition (Silverman 1986). Consequently, density estimation has received wide attention; see, for examples, the books by Devroye and Gyorfi (1985), Nadaraya (1989), Scott (1992), Silverman (1986), and Tapia and Thompson (1978). An important method for estimating a compactly supported density is to estimate the logarithm of the density (log-density) by using a basis function representation for the log-density. The estimation of logdensities has been previously considered by Stone and Koo (1986) and by Stone (1990, 1991, 1994) using splines, by Barron and Sheu (1991) using splines, polynomials, and trigonometric series, and more recently by White (1992) and by Modha and Fainman (1994) using single hidden layer sigmoidal networks. Neurnl Computntioti 8, 1107-1122 (1996) @ 1996 Massachusetts Institute of Technology
Dharmendra S. Modha and Elias Masry
1108
Single hidden layer sigmoidal networks have been recently used, in connection with the problem of regression estimation, to provide rates of convergence (for the integrated mean squared errors between certain minimum complexity regression estimators and the true regression function) that are independent of the dimension d (Barron 1994). Motivated by this fact, we further examine, in this paper, the density estimation scheme in White (1992) and in Modha and Fainman (1994) to see if it also provides a rate of convergence (in a sense to be made precise) that is independent of the dimension d for a certain class of densities (also to be made precise). We now summarize the contributions of this paper and outline its organization. In Section 2, we make precise a class of densities for which a rate of approximation (in the Kullback-Leibler distance), independent of the dimension d, can be obtained. In Section 3, we introduce a semiparametric family of densities, namely, an exponential family of densities based on single hidden layer sigmoidal networks. The semiparametric family of densities is specifically designed to well-approximate the class of densities introduced in Section 2. In Section 4, we construct a minimum complexity density estimator from the semiparametric family of densities introduced in Section 3 by utilizing the abstract minimum complexity density estimation framework of Barron and Cover (1991) and Barron (1991). Finally, as the main contribution, we establish a rate of convergence for the expected Hellinger distance between the proposed density estimator and the true density p' (Theorem 4.1). 2 A Class of Target Densities
We let In z log,, and log
= log,.
Assumption 2.1. Assiinze tlmt p' has cotripact supporf 8 e [-:.
51''
The compactness assumption, although stringent, is typical in the literature concerned with log-density estimation (Barron and Sheu 1991; Stone 1990, 1991, 1994). For i u = ( z L ~ . .. . . 7 i l L + ) and x = (sl.. . . . r,j) in 'I? let ', zu . x = zu,x, denote the usual inner product on Pi and let / I z L J I I ~ = Cf=,[ z o , ~denote a norm on P'.
xt=l
Assumption 2.2. Assumt that there exists a coniples ualued function f on Ed mi n corresponding riorninlizitig constarit C' > 0 siich thntfor x E 8,zue have
Convergence in Density Estimation
1109
and that J* IlwlIlf(w)ldw 5 C' < max(1, C'}.
00
for some known C' > 0. Set C
=
Remark 2.1. For x E 13, write
f'(x)
=
/*
(elw.' -
l)f(w) dw
But, since p* is a density, we may write the normalizing constant c' as
C* = [Lexp[f'(x)] dx1-l Let f* F lnp" denote the log-density of p * . Then, for x E tion 2.1 may be equivalently interpreted as
f*(x) = lnp*(x)= InC' +
B, equa-
Ld
(errox- l)f(w) dw
In other words, we assume that the log-densityf* has an inverse Fourier transform-type representation on the set B. More explicitly, f' has an extension yeoutside the compact set B, such that the extended function f" possesses a uniformly continuous gradient whose Fourier transform is absolutely integrable (Barron 1993). Assumption 2.2 also implies that the density p" is bounded away from zero and infinity on B (see Lemma 5.3). Assumption 2.2 on the log-density function f* is similar to and is inspired by the assumptions in Jones (1992) and Barron (1993) on the regression function in the context of regression estimation using projection pursuit and neural networks. Close parallels between log-density estimation and regression estimation have also been observed by Stone (1994). Assumption 2.2 determines a class of densities for which an exponential family of densities based on single hidden layer sigmoidal networks (see next section for definition) can provide a rate of approximation (in the Kullback-Leibler distance) independent of the dimension d (see Lemma 5.5). However, observe that the constant multiplier in Lemma 5.5 does depend on the dimension d. Assumption 2.2 is satisfied, for example, if we assume that the density p" has bounded and continuous derivatives of total order ld/2] 2 (Barron 1993). In this case, however, the rate of convergence obtained using the exponential family of neural networks is comparable to that obtained by exponential families of densities based on splines, polynomials, or trigonometric series (Barron and Sheu 1991; Stone 1994). Nonetheless, even in that case, the exponential family of densities based on single hidden layer sigmoidal networks still provides a convenient functional form for parameter estimation in higher dimensions (Modha and Fainman 1994).
+
Remark 2.2. Roughly speaking, one may interpret the integral Jxzt(eri1' 'l)f(w) dw in Assumption 2.2 as an infinite sum. Thus, we may interpret the density p* in Assumption 2.2 as being composed of an infinite
Dharmendra S. Modha and Elias Masry
1110
product. This idea may justify, in part, the projection pursuit density estimation schemes of Friedman et al. (1984) and Huber (1985). 3 An Exponential Family of Densities Based on Single Hidden Layer
Sigmoidal Networks In this section, we utilize various results of Barron (1994) to construct an exponential family of densities based on single hidden layer sigmoidal networks, which is especially designed to well-approximate the class of densities characterized by Assumption 2.2. 3.1 A Semiparametric Family of Densities. Let 4: SR sigmoidal function satisfying the following properties.
--t
92 denote a
Assumption 3.1 (Barron 1994). Assume that
-, 00 and $ ( I ( ) + 0 as u + -co. 5 1 and I$(U) - $(v)I 5 A',lu - v l f o r all u,v E 8 a n d f o r some
1. q ( u ) i 1as 11
2. lq5(m)l A', > 0. Set A1
= max(1.A;).
3. lq!~(u)- l ~ f , >5oAk/lulA'for ~l 11 E 8, u # 0, and f o r some A3 > 0 and A; 2 0. Set A2 = max(1,A;).
Assumption 3.1 is satisfied, for example, by probability distribution functions such as Gaussian, double exponential, logistic, and Cauchy. For n 2 1, let yIl = n ( d + 2 ) . For 1 5 i 5 n, let a, E Pi,let b, E 92, and let c, E %. We define a y,,-dimensionalparameter vector Q as follows:
0 = ( ~ 1 u?. . ...
b l , bZ, . . . , bll; ~ 1 CZ? .
. . . cI1) %
(3.1)
Now, define a single hidden layer sigmoidal network with n "hidden 8 parameterized by H as follows: units" f H : B --f
c ,I
fax)=
$(a,
Cf
'
x + b,)
(3.2)
i=1
and define a real valued normalizing bias weight corresponding to H as ?# = -In
1
exp[fs(x)]d x
Now, define an exponential density pe : B follows:
p d x ) = exp [cs +fs(x)l
(3.3) -+
rR parameterized by
H as
(3.4)
Observe that for all x E B, we have ps(x) > 0, and that J B p s ( x ) d x = 1. Thus, p H is a density on B and correspondingly ( T o +fe) is a log-density on B.
Convergence in Density Estimation
r2
1111
For rl.7 2 > 0, define a compact subset of Wi parameterized by TI and as
Now, define
where A,,
A2,
and A3 are as in Assumption 3.1 and define
S(")= n(fy2C.m,,)
(3.6)
From now on, for each fixed n, we restrict 8 to take values in the set S@). For 8 E S("),we have from equations 3.2, 3.6, and Part (2) of Assumption 3.1 that for x E B, Ife(x)I
I 2CAi
(3.7)
It then follows from equations 3.3 and 3.7 that
Ice( I2C-41
(3.8)
Now, define the exponential family of densities parameterized by elements of S(")as
In other words, r(")denotes an exponential family of densities based on single hidden layer sigmoidal networks with n hidden units, such that the parameters of the single hidden layer sigmoidal networks are constrained to take values in the set S @ ) .It follows from Lemma 5.5 (by setting E , , ~ = 0) that the set of functions uE,r(")is dense, in the KullbackLeibler sense, in the class of all functions satisfying Assumption 2.2. Given N observations {X,}zl, it is our ultimate objective to postulate an estimator for the density p* from the class of densities r z U,21r("). In the next subsection, working toward that objective, we construct a sequence of subsets of r, namely, { r N } N z I indexed by the sample size N. Then, for each given N,we will select our estimator PN for p* from r N . 3.2 A Sequence of Sieves. Let
0' = ( u \ , u ~ ., . . , u;; bi, b;, . . . , V"; c;, . . . , c;) E Fn
and let I9 be as in equation 3.1, then we define a metric P,, on RTn as
Dharmendra S. Modha and Elias Masry
1112
For each fixed n, given an E > 0, we define an e-net of S”’ to mean a finite set T,, c S(”)such that for every 6‘ E S(”)there exists an element 0’ E T,, satisfying plr(0’,0) 5 E . For E > 0, define E(E)
=
{ (-d
24
2E12
d
‘ -“
”‘
5) d :
11.12,.
. . .1,1
= 0 , l . -1.2,
-2,. . .
c Rd F(E) G,,(E)
= =
( ~ E: II = 0.1, -1.2. -2.. . .} C Y? (0 : for j = 1.2.. . . n. c, E F(2Ce/n), b, E F ( E ) . a, E E ( E ) } ~
c 87’‘ For each fixed n, we let { E , , , N } N ? I be a strictly decreasing sequence of positive real numbers decreasing to zero, that is, as N + 00 we have E,,,N 1 0. Then, for each fixed N , we define a set T,,.N as
Lemma 3.1. Let 0 <
E,,,N
5 1, then Tll.~ofequation 3.9 is a n Ell,N-netofSil’’
such that tii.N
where card(T , I , ~denotes ) the cardinality of the set T I l , ~ . Proof. See Lemma 2 in Barron (1994).
0
Let ( f i , N } N > I denote a sequence of increasing positive integers, such that l i N 7 00 as N + cm.Now, define the class of discretized parameters as (3.10) For a given N, we will restrict our estimation procedure to search only over the class of densities parameterized by elements of O N , namely, l?N
{PS:
0EO
N }
cr
r N is known as a “sieve.“ As N + 00, the sieves are parameterized by progressively larger numbers of parameters (since K N + 00). Also, as N + 00, the sieves are parameterized by progressively finer parameters (since E,,,N 1 0). Notice that we have yet to make I Z and ~ E,,N precise; we will prescribe appropriate values for IZN and E,,,N in Theorem 4.1.
Convergence in Density Estimation
1113
3.3 A Measure of Description Complexity. Now, fix an N 2 1. Then, we wish to assign description lengths to the elements of ON. For Q E ON, there exists an index n E {1,2, . . . , K N } such that 8 E T n , ~then ; define
L N ( ~=) Ln,N
+ 2 log(n + 1)
(3.11)
where L n , is ~ as in Lemma 3.1. Now, observe that 2-LN(s) BEON
c KN
=
(sgN
2-210g(n+1)
zpLn,N)
n=l
(3.12)
where (a) follows from Lemma 3.1. Also, observe that for all 8 E ON, we have
LN(Q) 2 2
(3.13)
4 Minimum Complexity Density Estimation
denote an estimator for the density p' based on the observations j3N is defined as the expected Hellinger distance Let
pN
{Xi}Ll, then the statistical risk of using
E L
[m-
&GI2dx
We now introduce a specific estimator and obtain an upper bound on its statistical risk (Theorem 4.1). Define
where
E
rN,
h = Pe,
and define the minimum complexity density estimator as (4.2)
where the selection of the parameter X is discussed in Remark 4.3. The minimum complexity density estimator $N was introduced by Barron and Cover (1991) in an abstract setting (along with a number of examples). In this paper, we have adapted their framework to the semiparametric family of densities introduced in Section 3. Roughly speaking, the estimator j j N permits the use of greater complexity models (namely,
Dharmendra S. Modha and Elias Masry
1114
a larger number of hidden units), only if the resulting increase in the complexity [LN(~')]/N is offset by a matching decrease in the empirical likelihood loss -(l/N)CE,logps(Xi). In other words, as can be seen from Remark 4.1 below, the minimum complexity criterion allows us to select the number of hidden units (of the neural network based density used to estimate p*) in a data-driven fashion.
Theorem 4.1. Suppose Assumptions 2.1, 2.2, and 3.1 hold. Let X > 1, V% 5 ti^ 5 N , and for some r 2 1/2 let E,,,N satisfy 1 ~
< E1i.N 2
( n N ) r-
1
(4.3)
-
Jr;
then E
[&zG
-
m]dx
=0
log N 'I2
(7)
The proof can be found in Section 5.
Corollary 4.1. Suppose all the conditions in Theoreni 4.1 hold, then
The proof can be found in Section 5.
Remark 4.1. Equations 4.1 and 4.2 can also be written in the following more intuitive fashion. For 1 5 n 5 ti^, write
where for a given 6' E T,I,N,pe is defined in equation 3.4. Now, for each fixed regularization constant X > 1, define
is as in Lemma 3.1. We can now write ON = 6';,.~. Consequently, the minimum coinplexity density estimator PN can be written as P N = PA,,
Remark 4.2. In practice, it may be difficult to obtain i r r as . ~in equation 4.4, since TII,Nis a discrete grid of parameters. However, at least
Convergence in Density Estimation
1115
heuristically, one may instead use (4.5) where for a given H E S("),p , is defined in equation 3.4. A backpropagation algorithm for minimizing equation 4.5 was described in Modha and Fainman (1994). Remark 4.3. Regarding the selection of the parameter A, it is seen from the proof of Lemma 5.6 (see equation 5.7), where a bound on the index of resolvability is established, that the parameter X affects only the constant K8, which is smallest when X is chosen to be as small as possible. Since X > 1, the choice X = 1 6 for some small 6 > 0 is appropriatefrom a theoretical standpoint. However, in practical situations, it may be desirable to select X in a data-driven fashion using, say, delete-one cross-validation; note that in this case establishing convergence results for the corresponding density estimator remains an open problem.
+
Remark 4.4. Recently, White (1992) established that a certain exponential family of densities based on single hidden layer sigmoidal networks can approximate, in L1 sense, any compactly supported density. Corollary 4.1 complements the approximation results of White by establishing that the minimum complexity density estimator constructed from the exponential family of densities based on single hidden layer sigmoidal networks introduced in Section 3 can indeed learn, from empirical observations, the class of densities made precise by Assumptions 2.1 and 2.2. Remark 4.5. Suppose that the density p' satisfies Assumptions 2.1 and 2.2. Then, we have from Corollary 4.1 that
Noticeably, the exponent of N in the rate of convergence does not depend on the dimension d. Observe that while formulating the minimum complexity density estimator p N we only assumed (see Assumption 2.2) that Je IIw111f(w)ldw < co. It is not known what would be the rate of convergence of p N under Assumption 2.2 with J& //w/lif(w)/dw < co.s > 1. Now, on the other hand, suppose Assumption 2.1 holds and also suppose that the density p* has continuous and bounded partial derivatives of total order s. If we estimate p* using an exponential family of densities based on a traditional basis such as polynomials, p N , then it is known (Barron 1991; Barron and Sheu 1991) that /PN(X)
- p*(x)l dx = Op
(
(4.7) )s""'d'
where 0, denotes the order in probability. Noticeably, the exponent of N in the rate of convergence depends on the dimension d.
Dharmendra S. Modha and Elias Masry
1116
Direct comparison of equation 4.6 with 1.7 is not possible, since the estimators j i N and j N are formulated under different assumptions on the density p * . However, since Assumption 2.2 implies that the density p* has bounded and continuous partial derivatives of total order 1, if we set s = 1 for jN,then roughly ,& outperforms 5,, if d > 2. 5 Derivations ___
Let gl. 82: B
- 31, then define
ll
where c = 2 ore. It is well known that D,(pjlq) 2 0 for all densities p and q. Also, D, (pJ/ q ) = 0 if and only if p = q almost everywhere with respect to the Lebesgue measure. Relative entropy has a number of interesting properties; see, for example, Cover and Thomas (1991), Barron and Sheu (1991), and Barron and Cover (1991). Proof of Theorem 4.1. Define the itides of resoloability (Barron and Cover 1991) corresponding to the minimum complexity density estimator as
PN
where p s E r h . The significance of the index of resolvability RN(p*) stems from the following lemma, which shows that the statistical risk of the minimum complexity density estimator j7r; is bounded from above by the index of resolvability. 0 Lemma 5.1. Siippose Assziinptioti 2.1 holds, theti f o r X > 1 E
(dm m)'dx0[ R ,@*)I -
Proof. See Barron (1991).
=
0
Remark 5.1. To be sure, for Lemma 5.1 to hold we also need that the description complexities { L N ( H)}eEe\ satisfy the Kraft-McMillan inequality (equation 3.12) and the nondegeneracy condition (equation 3.13) (Barron 1991). But these conditions are satisfied simply by our construction.
Convergence in Density Estimation
1117
To complete the proof of Theorem 4.1, we now need only to upper bound the index of resolvability R N ( ~ * )We . now proceed toward that result using a series of lemmas. Lemma 5.2 (Stone 1991). Letg1,gZ: B and 1g2(x)1 I K. Fork = 1,2.define
hk
=
-
+
Rbesuckfkatforx E B, Igl(x)l
IK
1nj8 exp [gk(x)] dx
Then,
Proof.
(hi
-
k2)2
=
{Ins, exp [81(x)] dx - In
1
2
exp k2(x)] dx}
where (a) follows from differentiability of the natural logarithm and from the mean value theorem. Write H1 = Ja exp [gl(x)] dx and H2 = Jn exp [gz(x)] dx. Then, H is such that
ecK I min{H1,H2} I H 5 max{Hl,H~}5 eK
(5.1)
(b) follows from differentiability of the exponential and from the mean value theorem. Also, for x E t?, go(x) is such that -K I min{gl(x),gz(x))I go@) I max{gl(x),g2(~))I K
(5.2)
(c) follows from the Cauchy-Schwarz inequality; and (d) follows since we have from equation 5.1 that 1/H 5 eK,and we have from equation 5.2 0 that Ja exp[2go(x)]dx i e2K. Lemma 5.3. Suppose Assumpfions 2.1 and 2.2 hold, then for x E
IInp*(x)I I C.
B we have
Dharmendra S. Modha and Elias Masry
1118
Proof. For s E B, we have
IC
I
5 C,'2
(5.3)
where (a) follows since for I ( E %' we can write Ic"' - 11 = / i ,l,;'e"'dil/ 5 luj; (b) follows since for s E B we haire s/5 I i o ~5 ~~ ~ , ? l ( l / 2 ) ~ x ~ l ~ ; and (c) follows from Assumption 2.2. Now, from equations 2.2 and 5.3 we have liil
x:zl
11nC-I = ~ i n , / ~ e x p ! ~ f -t iisvj )5j ~ ; 2 .
(5.4)
Finally, since l n p ' ( s ) / = 1 InC- ~ . f - ( x ) we l , have the result from equations 5.3 and 5.4. 0 The following result captures the function approximation properties of single hidden layer sigmoidal networks. Lemma 5.4. Sirpposc Aswniptioiis 2.1, 2.2, mid 3.1 Md, tlieii flww rxists O;!"' E Tlf,.\~ 1 1 ~ 1tllat 1
17
Proof. It follows from Assumptions 2.1, 2.2, and 3.1 and from Theorem 3 in Barron (1993) that there exists a H " " ' E S'"'such that
(5.5) Also, it follows from Assumption 3.1 and by proceeding as in Lemma 1 of Barron (1994) that there exists a H;!'" E T,,,,\ such that ]/fo.(Jv
-fg;,,:,i12
5 ~A~CE,,A
(5.6)
The lemma now follows from equations 5.5 and 5.6 and from the triangle inequality. 0 The following result captures the density approximation properties of the exponential family of densities F"'. Lemma 5.5. SiipPoseAssioIiptioiis 2.1, 2.2, niid 3.1hold, mid defirieO$"' ns in Lemma 5.4, then
E
TII.N
1119
Convergence in Density Estimation
Proof.
where (a) follows since for any two densities p and 9, we have from Lemma 1 in Barron and Sheu (1991) 1 2
~ , , ( p l l qI ) -
e
~
j'"p(x) ~ ~ [lnp(x) n ~- 1nq(x)12 ~ ~ ~ ~l x
~
~
(b) follows from the triangle inequality and from equations 3.7, 3.8, and Lemma 5.3
(c) follows from the triangle inequality; (d) follows from Lemma 5.2 with K = 2AlC;and (e) follows from Lemma 5.4 and the properties A, 2 1, c 2 1. 0
Finally, we upper bound the index of resolvability using the method developed in Modha and Masry (1996) based on the ideas in Barron (1994).
1120
Dharmendra S. Modha and Elias Masry
Lemma 5.6. Suppose all the conditions in Theorem 4.1 hold, then
Proof.
where (a) follows from equation 3.10; (b) follows from equation 3.11; (c) follows from Lemma 5.5 and Lemma 3.1; (d) follows from equations 4.3 6A1(J2, K2 = and 3 5, ( e ) follows by setting K1 = ,\I(1 + 21, K ; = 2(4 " 4. A?' P, and K4 = 2X, (f) follows since 1 5 n 5
+
Convergence in Density Estimation
1121
+
5 N; (g) follows by setting K5 = max(K2,&), K6 = 2K3, K7 = (3A3 1+ 4A3v)/(2A3); (h) follows for N 2 2 and b setting Ks = KS(1og K6 + K7); and (i) follows by selecting n = NllogN1. Also, since N 2 2 and KN 2 we have 1 5 I KN. The proof of the theorem now follows from Lemmas 5.1 and 5.6.
e
KN
a,
[,/=I
Proof of Corollary 4.1. We have from p. 225 of Devroye and Gyorfi (1985)
I / n [m-fi]2dx (5.8)
21 [ / n ~ i ) N ( X ) - ~ * ( X ) ~ d x ] z
Also, we have from the Cauchy-Schwarz inequality that (5.9) The corollary now follows from equations 5.8, 5.9, and Theorem 4.1. Acknowledgment
This work was supported by the Office of Naval Research under Grant N00014-90-J-1175. References Barron, A. R. 1991. Complexity regularization. In Proceedings NATO Advanced Study institute on NonparamefricFunctional Estimation, G. Roussas, ed., pp. 561576. Kluwer Academic, Dordrecht, The Netherlands. Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I E E E Trans. Inform. Theory 39(3), 930-945. Barron, A. R. 1994. Approximation and estimation bounds for artificial neural networks. Mach. Learn. 14, 115-133. Barron, A. R., and Cover, T. M. 1991. Minimum complexity density estimation. IEEE Trans. Inform. Theory 37(4), 1034-1054. Barron, A. R., and Sheu, C.-H. 1991. Approximation of density functions by sequences of exponential families. Ann. Statist. 19(3), 1347-1369. Cover, T. M., and Thomas, J. A. 1991. Elements ofIlzformation Theory. John Wiley, New York. Devroye, L., and Gyorfi, L. 1985. Nonparametric Density Estimation: The l.1 Viczu. John Wiley, New York. Friedman, J. H., Stuetzle, W., and Schroeder, A. 1984. Projection pursuit density estimation. I. Am. Statist. Assoc. 79(387), 599-608. Huber, P. J. 1985. Projection pursuit. Ann. Statist. 13(2), 435475. Jones, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20(1), 608-613.
1122
Dharmendra S. Modha and Elias Masry
Modha, D. S., and Fainman, Y. 1994. A learning law for density estimation. I E E E Tmns. Neirrnl Netzuorks 5(3), 519-523. Modha, D. S., and Masry, E. 1996. Minimum complexity regression estimation with weakly dependent observations. l E E E Trn~is.rnfbrrflntior? Theory (to appear). Nadaraya, E. A. 1989. NoiiprmiCfric Estinintiori qf Probobility Dmsities nizd Rrgrc’ssiori Cirrzles. Kluwer Academic, Dordrecht, The Netherlands. Rissanen, J. 1989. Stoclinstir Coriiplerity ir7 Stntisticnl Itiquiry. World Scientific Publishers, Teaneck, NJ. Rissanen, J., Speed, T.,and Yu, B. 1992. Density estimation by stochastic complexity. E E E Trl71is./rform. TIi~ory38(2), part 1, 3155323. Scott, D. W. 1992. Midtirwintc Dcnsity Estimtioii: 7’/1eory,Prnctict’, am1 Visunliznt i o ~ i .John Wiley, New York. Silverman, B. W. 1982. On the estimation of a probability density function by the maximum penalized likelihood method. Aiiii. Statist. 10(3), 795-810. SiliTerman, B. W. 1986. Density Estimntictrifor Stntistics niid Dntn Annhysis. Chapman and Hall, London. Stone, C. J. 1990. Large-sample inference for logspline models. A17ri. Stntist. 18(2), 717-741. Stone, C. J. 1991. Asymptotics for doubly flexible logspline response models. A I I I IStntisf. . 19(4), 1832-1854. Stone, C. J. 1994. The use of polynomial splines and their tensor products in multivariate function estimation. A m . Statist. 22(1), 118-181. Stone, C. J., and Koo, C.-Y. 1986. Logspline density estimation. C o r z k r ~ pMoth. . 59, 1-15. Tapia, R. A,, and Thompson, J. R. 1978. Noiiprnrrwtrir~Prcibnbilify Dcrisity Esfinintroii. Johns Hopkins University Press, Baltimore, MD. White, H. 1992. Parametric statistical estimation with artificial neural networks. In Mnt/icnrntical P t v y e c t i w s O I I Nciirol Networks, P, Smolensky, M. C. Mozer, and D. E. Rumelhart, eds. Lawrence Erlbaum, Hillsdale, NJ.
Received October 19, 1994, accepted No\ ember 21, 1995
This article has been cited by: 2. M. Magdon-Ismail, A. Atiya. 2002. Density estimation and random variate generation using multilayer networks. IEEE Transactions on Neural Networks 13:3, 497-520. [CrossRef] 3. Xiaohong Chen, H. White. 1999. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory 45:2, 682-691. [CrossRef]
Communicated by Steven Nowlan
Modeling Conditional Probability Distributions for Periodic Variables Christopher M. Bishop Ian T. Nabney Neural Computing Research Group, Department of Computer Science and Applied Mathematics, Asfon University, Birmingham, B4 7ET, U.K.
Most conventional techniques for estimating conditional probability densities are inappropriate for applications involving periodic variables. In this paper we introduce three related techniques for tackling such problems, and investigate their performance using synthetic data. We then apply these techniques to the problem of extracting the distribution of wind vector directions from radar scatterometer data gathered by a remote-sensing satellite. 1 Introduction
Many applications of neural networks can be formulated in terms of a multivariate nonlinear mapping from an input vector x to a target vector t. A conventional neural network approach, based on least squares, for example, leads to a network mapping that approximates the regression (i.e., the conditional average) of t given x. For mappings that are multivalued, however, this approach breaks down, since the average of two solutions is not necessarily a valid solution. This problem can be resolved by recognizing that the conditional mean is just one aspect of a more complete description of the relationship between input and target, obtained by estimating the conditional probability density of t conditioned on x, written as p(t I x). The least-squares approach then corresponds to maximum likelihood for the special case in which p(t 1 x) is modeled by a gaussian distribution which is spherically symmetric in t-space and which has an x-dependent mean. Although techniques exist for modeling general conditional densities when the target vectors lie in Euclidean space, they are not appropriate when the targets are periodic. Direction and (calendar) time are two quantities that are periodic and that occur frequently in applications. In this paper, we introduce three general techniques for modeling the conditional distribution of a periodic variable. We then investigate and compare their performance using synthetic data, as well as data collected from the ERS-1 remote sensing satellite. Neural Computation 8,1123-1133 (1996) @ 1996 Massachusetts Institute of Technology
Christopher M. Bishop and Ian T. Nabney
1124
2 Density Estimation for Periodic Variables
__
-
A commonly used technique for irnconditioiial density estimation is based on mixture models of the form
where o , are called mixing coefficients, and the component functions, or kernels, o , ( t ) are typically chosen to be Gaussians (McLachlan and Basford 1988; Titterington rt 01. 1985). To turn this into a model for conditional density estimation, we simply make the mixing coefficients, as well as any adaptive parameters in the component densities, into functions of the input vector x. To achieve this we set the mixing coefficients and parameters from the outputs of a neural network that takes x as input. This approach underlies the “mixture of experts” model (Jacobs rf al. 1991) and has also been considered by a number of other authors (Bishop 1994; Liu 1994. In this section we extend this technique to provide three distinct methods for modeling the conditional density p(H j x ) of a periodic variable H conditioned on an input vector x. We also compare these methods with earlier approaches for treating periodic variables. 2.1 Mixtures of Wrapped Normal Densities. The first technique that we consider involves a transformation from a Euclidean variable k E (-x. x)to the periodic variable H E [O. 2i;) of the form H = 4 mod 27r. This can be visualized as wrapping the infinite real line around a circle of unit radius It induces a transformation that maps density functions p with domain into density functions p with domain [O. 27) as follows:
It is clear by construction that the function p satisfies the periodicity requirement p(H + 2a x) = j l ( H 1 x). It is also normalized on the interval 10. ~ T Jprovided , p( \ 1 x ) is normalized on R, since ~
Various choices for the component density functions that make up the mixture p( j x) are possible, but here we shall restrict attention to
Conditional Probability Distributions
1125
functions that are Gaussian of the form
4,
where t E w. The transformed density function is known as the "wrapped normal" distribution (Kotz and Johnson 1992). The density function p ( x I x) is modeled using a combination of a neural network and a mixture model as described above. In this paper we use a standard multilayer perceptron with a single hidden layer of sigmoidal units and an output layer of linear units. To ensure that the mixture model in equation 2.1 is a density function, it is necessary that the mixing coefficients a;(x) satisfy the constraints
for all x. This can be achieved by choosing the a;(x) to be related to the network outputs by a normalized exponential, or softmax function (Jacobs
etal. 1991)
where 2; represents the corresponding network outputs. The centers p l of the kernel functions are represented directly by the network outputs; this is motivated by the corresponding choice of an uninformative Bayesian prior, assuming that the relevant network outputs have uniform probability distributions (Berger 1985; Jacobs et al. 1991). The standard deviations a,(x) of the kernel functions represent scale parameters and so it is convenient to represent them in terms of the exponentials of the corresponding network outputs. This ensures that a,(x) > 0 and discourages o,(x) from tending to 0. Again, it corresponds to an uninformative prior in the Bayesian framework. The adaptive parameters of the model (the weights and biases in the network) are optimized by maximum likelihood. In practice it is convenient to minimize an error function E given by the negative logarithm of the likelihood function. Derivatives of E with respect to the network weights can be computed using the rules of calculus (Bishop 1994), and these derivatives can then be used with standard optimization procedures to find a minimum of the error function. The results presented in this paper were obtained using a conjugate gradient algorithm. One limitation of the maximum likelihood approach is that it leads to biased solutions since it underestimates the variance of a distribution in regions of low data density (Bishop 1995). An extreme example occurs if a component density function collapses onto one of the data points, giving zero variance and an infinite likelihood. For the applications reported in this paper, this effect will be small since the number of data points
1126
Christopher M. Bishop and Ian T. Nabney
is large and we are dealing with a one-dimensional target space. The use of an exponential relationship between the variance and the network output, discussed above, also helps to avoid pathological solutions. In a practical implementation, it is necessary to restrict the value of N in the summation. We have taken the summation over 7 complete periods of 2 s spanning the range (-7ii. 7-1).Since the component Gaussians have exponentially decaying tails and the standard deviations are typically 1 28, this truncation introduces negligible error provided that care is taken in initializing the network weights so that the kernels have their means in the central interval (--I. a ) . 2.2 Mixtures of Circular Normal Densities. The second approach to periodic conditional density estimation is also based on a mixture of kernel functions, but in this case the kernels themselves are periodic, thereby ensuring that the overall conditional density function will be periodic. The particular form of kernel function that we use can be motivated by considering a vector v in two-dimensional Euclidean space for which the probability distribution p ( v ) is a symmetric Gaussian. By / we can show that using the transformation i', = ilvl/cosH, T I , , = / / v /sinH, the conditional distribution of the direction 0, given the vector magnitude !ivii, is given by
(2.7) which is known as a circirlnr rioriiinl or i w i M i w s distribution (Mardia 1972). The normalization coefficient is expressed in terms of the zerothorder modified Bessel function of the first kind, I,,( m i ) , and the parameter 111 (which depends on //vll in this derivation) is analogous to the inverse \.ariance parameter in a conventional normal distribution. The parameter corresponds to the mean of the density function. Again the parameters (t)(x),c',(x),and n i , ( x ) in the corresponding mixture model are determined by the outputs of a neural network taking x as input, and the network weights are determined by minimizing the negative log likelihood defined with respect to the training data. Because l o i m ) is asymptotically an exponential function of mi, some care must be taken in the implementation of the error function and its derivatives to avoid overflow in the results of intermediate calculations. I
2.3 Expansion in Fixed Kernels. The third and final technique introduced in this paper involves a conditional density model consisting of a fixed set of periodic kernels, again given by circular normal functions as in equation 2.7. In this case the mixing proportions alone are determined by the outputs of a neural network (through a softmax activation funcand width parameters m,are fixed. tion equation 2.6) and the centers We have selected a uniform distribution of centers, and set m, = rn for i.lI
Conditional Probability Distributions
1127
each kernel, where the value for rn was chosen to give moderate overlap between the component functions. Clearly a major drawback of fixed-kernel methods is that the number of kernels must grow exponentially with the number of output-space variables. This is an example of the ”curse of dimensionality” (Bellman 1961; Bishop 1995). For the single output variable considered here, however, the number of kernel functions that is required is small, and the technique can be regarded as practical. 2.4 Related Work. The problem of modeling periodic variables has been well studied. In Mardia (1972) there is an introduction to conventional statistical approaches including the modeling of simple distributions. An approach to the problem of modeling more complex distributions involving multiple variables is that of Directional Unit Boltzmann Machines (DUBM) contained in Zemel et al. (1995). In this paper the theory of a Boltzmann machine whose units have associated densities that are circular normal distributions is developed. However, their approach is not suitable for the applications considered here for two reasons. First, the applications we consider have real-valued inputs, but in the DUBM all units must be directional. Second, the applications have a multimodal conditional distribution of the target variable. However, the deterministic version of the DUBM models the output density with a single circular normal, which is adequate only for unimodal distributions. The stochastic version does not suffer from this restriction, but requires extremely long training times.
3 Application to Synthetic Data
To test and compare the methods introduced in Section 2, we first consider a simple problem involving synthetic data, for which the true underlying distribution function is known. This data set is intended to mimic the salient features of the real data to be discussed in the next section. It is generated from a mixture of two triangular distributions where the centers and widths are taken to be linear functions of a single input variable x,and the mixing coefficients are fixed at 0.6 and 0.4. Any values of B that fall outside ( -x.T ) are mapped back into this range by shifting in multiples of 2x to give a distribution which is periodic. An example data set generated in this way is shown in Figure 1. Three independent data sets (for training, validation, and testing) were generated from this distribution, each containing 1000 data points. For each technique, training runs were carried out in which the number of hidden units, as well as the number of kernels in the mixture model, were varied systematically to determine good values by minimizing the error obtained on the validation set. Table 1 gives a summary of the best
Christopher M. Bishop and Ian T. Nabney
1128
X
0
0.2
0.4
0.6
0.8
1
I
,. 050
-
0.40
-
-.
3:
Figure 1: (a) Scatter plot of the synthetic training data. (b) Contours of the conditional density p ( H 1 st obtained from a mixture of adaptive circular normal functions as described in Section 2.2. (c) The distribution p(H [ s) for s = 0.5 (solid curve) from the adaptive circular normal model, compared with the true distribution (dashed curve) from which the data were generated. (d) The same data as in (c) shown a s a polar plot.
results from each of the three methods. We see that, for this data set, the best results, as determined from the test set, were obtained using the mixture of adaptive circular normal functions. Plots of the corresponding distributions are shown in Figure 1.
Conditional Probability Distributions
1129
Table 1: Results Obtained Using Synthetic Data'
Method
Centers
Hidden units
Validation error
Test error
6 6 36
7 8 7
1177.1 1109.5 1184.6
1184.4 1133.9 1223.5
~~
1 2 3
'Method 1: Mixture of wrapped normal functions. Method 2: Mixture of adaptive circular normal functions. Method 3: Mixture of fixed kernel functions.
4 Application to Radar Scatterometer Data One of the original motivations for developing the techniques described in this paper was to provide an effective, principled approach to the analysis of radar scatterometer data from satellites such as the European Remote Sensing satellite ERS-1 (Thiria et al. 1993; Bishop and Legleye 1995). The ERS-1 satellite is equipped with three C-band radar antennae that measure the total backscattered power (written as ao) along three directions relative to the satellite track, as shown in Figure 2. When the satellite passes over the ocean, the strengths of the backscattered signals are related to the surface ripples of the water (on a scale of a few centimeters), which in turn are determined by the low-level winds. Extraction of the wind vector from the radar signals represents an inverse problem that is typically multivalued. Although determining the wind speed is relatively straightforward, the data give rise to aliases for the wind direction. For example, a wind direction of 0 will sometimes give rise to similar radar signals to a wind direction of 0 + T , and there may be further aliases at other angles. A conventional neural network approach to this problem, based on a least-squares estimate of 0, would predict directions that were given by conditional averages. Since the average of several valid wind directions is not itself a valid direction, such an approach would clearly fail. In this paper we show how such problems can be avoided by extracting a complete distribution function of wind directions, conditioned on the satellite measurements. For this application, the modeling of the conditional distribution of wind direction provides the most complete information for the next stage of processing, which is to "dealias" the wind directions by combining information from groups of wind-field cells, together with prior knowledge, to determine the most probable overall wind field. A large data set of ERS-1 measurements, spanning a wide range of meteorological conditions, has been assembled by the European Space Agency in collaboration with the UK Meteorological Office. Labeling of the data set was performed using wind vectors from the Meteorological
Christopher M. Bishop and Ian T. Nabney
1130
satelIi te
500 km Figure 2: Schematic illustration of the ERS-1 satellite showing the footprints of the three radar scatterometers. Office Numerical Weather Prediction model. These values were interpolated from the inherently coarse-grained model to regions coincident with the scatterometer cells. The data that were selected for the experiments reported in this paper were collected from low-pressure (cyclonic) and high-pressure (anticyclonic) circulations. These conditions, rather than cases that were homogeneous or with a simple gradient in speed or direction, were chosen to provide a more challenging task to test the modeling techniques. Ten wind fields from each of the two categories were used: each wind field contains 19 x 19 = 361 cells, each of which represents an area of approximately 26 x 26 km. After removal of completed data, this resulted in training, validation, and test sets each containing 1963 patterns. The inputs used for modeling the data were the three values of 00 for the aft-beam, mid-beam, and fore-beam, together with the sine of the incidence angle of the mid-beam, since this angle strongly influences the
Conditional Probability Distributions
1131
Table 2: Results on Satellite Dataa Method
Centers
Hidden units
Validation error
Test error
1 2 3
4 6 36
20 20 24
2474.6 2308.0 2028.9
2446.2 2337.9 1908.9
OMethod 1: Mixture of wrapped normal functions. Method 2: Mixture of adaptive circular normal functions. Method 3: Mixture of fixed kernel functions.
reflected signal received by the scatterometer. Each ~ 7 0input was scaled to have zero mean and unit variance, while the fourth input value was passed to the network unchanged. The target value was expressed as an angle clockwise from the satellite’s forward path and converted to radians. Again, a conjugate gradient algorithm was used to optimize the network weights. Table 2 gives a summary of the preliminary results obtained with each of the three methods. As expected, the fact that this is a more complex domain than the synthetic problem meant that there were more difficulties with local optima. In fact, over 75% of the training runs ended with the network trapped in a poor minimum of the error function. This problem was considerably reduced (to about 25% of the runs by initializing the network weights so that the initial centers of the kernel functions were approximately uniformly spaced in [O, 27r). Of the adaptive-center models, the one with six centers has the lowest error on the validation data; however, fewer centers are actually required to model the conditional density function reasonably well. This can also be seen in Figure 3, which shows the conditional distribution of wind directions at a typical data point from the test set, and which clearly has fewer than eight peaks.
5 Discussion
In this paper we have introduced three distinct but related techniques for modeling the conditional probability distribution of a periodic variable, and we have illustrated the use of these techniques in a simple synthetic problem, and on radar scatterometer data from a remote sensing satellite. All three methods give reasonable results, with the adaptive-kernel approaches somewhat outperforming the fixed-kernel technique on synthetic data, and vice versa on the scatterometer data. A conventional network approach, involving the minimization of a sum-of-squares error function or the use of a DUBM, would perform very poorly on these problems since the required mapping is multivalued.
Christopher M. Bishop and Ian T. Nabney
1132
250
200
Memodl -
1 9
m 100
OM
OW
Figure 3: Linear and polar plots of the conditional distribution p ( H I X) for a sample input vector from the test set. The dominant alias at T is evident. In both plots, the solid curve represents method 1, the dashed curve represents method 2, and the curve with diamonds represents method 3. The two fully adaptive methods (methods 1 and 2) give largely similar results. This is not unexpected, since the kernels used are similar functions. Generalizing the approach, there is a range of possible models with different parameters fixed: the third method is an extreme case where the only adaptive parameters were the mixing coefficients. One aspect of these algorithms that is more complex than conventional network optimization is the problem of model order selection. The incorporation of a mixture model means that there are two structural parameters to select: the number of hidden units in the network and the number of components in the mixture model. Changes to either of these parameters will change the number of adaptive weights in the network, and so the two parameters are closely coupled. In this paper we have varied both of these structural parameters in a systematic way and sought the optimum network by measuring performance on an independent validation set. It is likely that the use of a larger, fixed network structure, together with regularization to control the effective model complexity, will significantly simplify the process of model order selection.
Acknowledgments
~
We are grateful to the European Space Agency and the UK Meteorological Office for making available the ERS-1 data. The contributions of Claire Legleye in the early stages of this project are also gratefully acknowledged. We would also like to thank Iain Strachan and Ian Kirk
Conditional Probability Distributions
1133
of AEA Technology for a number of useful discussions relating to the interpretation of this data. References Bellman, R. 1961. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ. Berger, J. 0. 1985. Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York. Bishop, C. M. 1994. Mixture density networks. Tech. Rep. NCRG/94/004, Neural Computing Research Group, Aston University, U.K. Bishop, C. M. 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford. Bishop, C. M., and Legleye, C. 1995. Estimating conditional probability distributions for periodic variables. In Advances in Neural Information Processing Systems, D. S. Touretzky, G. Tesauro, and T. K. Leen, eds., Vol. 7, pp. 641-648. MIT Press, Cambridge, MA. Jacobs, R. A,, Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Kotz, S., and Johnson, N. L. eds. 1992. Encyclopediaofstatistical Sciences, pp. 381386. John Wiley, New York. Liu, Y. 1994. Robust neural network parameter estimation and model selection for regression. In Advances in Neural lnformation Processing Systems, Vol. 6, pp. 192-199. Morgan Kaufmann, San Mateo, CA. Mardia, K. V. 1972. Statistics of Directional Data. Academic Press, London. McLachlan, G. J., and Basford, K. E. 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. Thiria, S., Mejia C., Badran, F., and Crepon, M. 1993. A neural network approach for modeling nonlinear transfer functions: Application for wind retrieval from spaceborne scatterometer data. J. Geophys. Res. 98(C12), 22827-22841. Titterington, D. M., Smith, A. F. M., and Makov, U. E. 1985. Statistical Analysis of Finite Mixture Distributions. John Wiley, New York. Zemel, R. S., Williams, C. K. I., and Mozer, M. C. 1995. Lending direction to neural networks. Neural Networks 8, 503-512.
Received June 9, 1995; accepted December 19, 1995
This article has been cited by: 2. Eric A. Stützle, Tomas Hrycej. 2005. Numerical method for estimating multivariate conditional distributions. Computational Statistics 20:1, 151-176. [CrossRef] 3. M. Pardo, G. Sberveglieri. 2002. Learning from data: a tutorial with emphasis on modern pattern recognition methods. IEEE Sensors Journal 2:3, 203-217. [CrossRef]
ARTICLE
Communicated by Christian Omlin
The Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine Extraction Mike Casey' Department of Mathematics, University of California, San Diego, La Jolla, C A 92093 U S A
Recurrent neural networks (RNNs) can learn to perform finite state computations. It is shown that an RNN performing a finite state computation must organize its state space to mimic the states in the minimal deterministic finite state machine that can perform that computation, and a precise description of the attractor structure of such systems is given. This knowledge effectively predicts activation space dynamics, which allows one to understand RNN computation dynamics in spite of complexity in activation dynamics. This theory provides a theoretical framework for understanding finite state machine (FSM) extraction techniques and can be used to improve training methods for RNNs performing FSM computations. This provides an example of a successful approach to understanding a general class of complex systems that has not been explicitly designed, e.g., systems that have evolved or learned their internal structure. 1 General Discussion
In spite of the interest and progress in the study of recurrent neural networks ( R " s ) in the last several years, R " s are still often seen largely as black boxes without a solid theoretical understanding of how they compute. Many authors have directly or indirectly approached the problem of understanding their inner workings (Cleermans et al. 1989; Cummins 1993; Giles et al. 1992; Yamauchi and Beer 1994; Tino et al. 1995b)' 'Present mailing address: Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254. 'Yamauchi and Beer (1994) independently developed a very nice method of analysis of continuous-time RNNs based upon ideas from dynamical systems theory that parallels ours in several ways. The methods that we develop in this paper are, however, more powerful in several ways, and can be used so long as the desired behavior is discrete in time (if what is essential about the inputs and outputs does not change at arbitrarily short time intervals). In this case we can deduce the existence of attractors and other dynamics of the continuous-time systems by deducing these properties for the maps we get by allowing the system to flow for a fixed period of time, which is a considerable extension of their methods.
hTeuvnlComputatioti 8, 1135-1178 (1996) @ 1996 Massachusetts Institute of Technology
1136
Mike Cascy
but haye provided only a limited understanding for several special cases of the problem. Kolen (1991) expressed this fundamental lack of understanding by stating that "we are still unsure a s to what these networks are doing." The results in this paper answer that question for general RNNs performing finite state machine (FSM) computations, which can then give insight into other types of computation which physical systems perform. Rather than discussing the dynamics of several classes of RNNs performing specific FSM computations, we address the deeper question: "what dynamics i i i u s t an RNN possess to perform a given FSM computation?" and we present a unified approach to understanding RNN solutions to a wide variety of problems that can be formulated in terms of FSM. By describing what all well-trained networks have in common, this approach allows one to effectively predict RNN dynamics in a netirork that performs a given FSM computation based upon examining the deterministic finite automata (DFA) diagram associated with the FSM coniputation. As we shall see, this knowledge of the necessary dynamic properties of trained RNNs also makes it clear what properties the RNN must have to correctly respond to all possible inputs, thereby giving a prt>ciscunderstanding of generalization t o previously unseen inputs. Recently proposed measures of RNNs and their dynamics aim to elucidate the workings of RNNs. For example, Cleermans r t d . (1989), Elman (1989) and Giles cf (71. (1990), proposed that cluster analysis of hidden unit actiiration dynamics yields an adequate understanding of RNN behavior. Others (Pollack 1991) have taken time series from hidden unit phase space. Still others, for example, Williams and Zipser (1989), used connectivity strengths between units. In this paper we will show that none of these measures is sufficient for understanding RNN behavior in general, and we will describe more useful measures. This paper builds upon ideas first discussed in Casey (1993) and later published in the author's thesis, Casey (1995b). During the preparation of this work, we found that Cummins (1993) had independently discovered some similar results. Our theory includes his results as a special case and answers several questions he poses. Furthermore, by identifying which "environmental constraints" are most important for the task being performed, we are further able to understand why his experiments e that Tino vielded the results that they did. More recently, ~ 7 discovered i,t (71. (1995b) also studied the same problems we study and used an approach very similar to ours, but with more emphasis placed on the use of tools from dynamical systems theory to study solutions and the use o f bifurcation analysis to study RNN changes during training. This work may be seen as a discussion of representations. We show bvhat must be represented if a system is to perform an FSM calculation. These representations are given in terms of properties of the individual dynarnical systems (those dynamical systems we get by giving the system 1' constant input) and in terms of abstract sets in the RNN phase space.
Discrete-Time Computation
1137
This is a departure from discussing representations in terms of traces of phase space activities or in terms of the connectivity strengths between units, which may change drastically as the model changes. Many types of R " s have been shown to be capable of performing FSM computation (Siegelmann and Sontag 1992, Giles et al. 1992); in particular, R " s of the type studied in Williams and Zipser (1989) and Giles et al. 1992 can be used to model DFA. Omlin and Giles (1994) provide a constructive proof that an (N 2)-unit second-order RNN is capable of modeling any given N-state DFA. This result shows that the theory that we generate later in this paper is not vacuous since there always exist R " s that perform any given FSM computation. Rather than constructing specific solutions to the problem of modeling various DFA, we solve the converse problem by showing what all solutions must have in common and thereby give a sort of minimal complexity in terms of dynamical properties of the RNN. We have greatly benefited from the framework for solving problems with complicated temporal constraints that Giles et al. 1992 constructed. Hence, when it is useful to speak of a particular type of model, we will use their general framework, though our results are more general and apply to the firstorder and the simple recurrent network (as in Elman 1989) cases, and easily extend to problems with any finite number of inputs or outputs. More precisely, we use the following equations for the examples:
+
where x ( t ) is the state vector of activations of the network's N units, y(t) is the vector of M inputs to the network, and Wi is the second-order weight between the ith unit and the jth input for the kth unit. S is a fixed discriminant function. For the examples, we will use the logistic squashing function 1/(1 e-'), but the details of the models in our examples are not important. To use our theory the important requirements of a model are that there are as many distinct dynamical systems as there are possible inputs (as is the case for first- and second-order RNNs, Elman networks and many variations of these basic R" architectures).
+
2 Definitions
Let Z = [0,1] and i
E
{e,O,l}.
Definition 2.1. Consider an N-unit RNN with a fixed input vector, i. This mapping from the unit N-cube (IN) into itself will be called the h a p .
We will view an i-map as a dynamical system, with the weights o f the RNN as parameters of the system. For a second-order network with
Mike Casey
1138
a unary encoding,' changing the input is equivalent to changing all of the weights in the network when the RNN is viewed as a dynamical system. Hence, a second-order network with M binary input lines can be viewed as M A 1 independent dynamical systems.' We now generalize the definition of an i-map: Definition 2.2. For the string of symbols I to be the composition of the i,-maps.
=
ilzz
'
. i,,,
define the I-map
We now turn to DFA. Here and following, when applicable, we shall u w the notation of Lewis and Papadimitriou (1981), including the definition of a DFA. Definition 2.3. A detuiiiiiiistic fiizitc state niiforiiata (DFA) is a quintuple M =: (Q. 5. (4. s. F ) where Q is a finite set of stntcs, Y is an alphabet, s t Q is the iizitinl state, F C Q is the set of firinl s f i i f t ~and , 6, the fralzsiticvi fiiirctiori, is a function from Q ii Y to Q. In this paper we assume that all DFA models are minimal with respect to the number of states for the language that they recognize. DFA can be viewed as language recognition devices by saying that a string is accepted or rejected if after beginning in the initial state and then reading that string, the DFA is in a final state or a nonfinal state, respectively. Well known results of automata theory state that the languages that FSM can recognize are precisely the set of languages called the regular languages, and to each regular language there is associated a unique minimal DFA (up to a relabeling of states) that recognizes that language (see Lewis and Papadimitriou 1981). These results moti\rate the following definitions: Definition 2.4. Let (I be a mapping from a collection of input strings over an alphabet \', to a collection o f output strings over an alphabet Y2. we will call any such mapping a cortipfiitioti. C is said to be a finitr $tote rru7chiiir coiirpzitntioii if there exists a DFA, .M,., which can perform a n equivalent mapping. If .M,- is minimal (with respect to the number of states), then ,Mr is called the DFA nssocintrd iOith C. For the sake of clarity we will restrict our attention to FSM computations with output string alphabets containing only two symbols, corresponding to accepting or rejecting the input string. Returning to properties of DFA we have the following definitions. 'A unary encoding of N symbols uses N input lines. Each symbol is identified with input string containing a one in one of the N positions and zeros in all others 'The dyiamical systems are independent in the sense that their weights (parameters)
Discrete-Time Computation
1139
Definition 2.5. For n 2 1, and a string I of length n, define an n-cycle of the I-map to be an ordered set of n states QI = {q,,. . . . .q,,,} such that when 1 is read, the DFA will pass through these states, beginning and ending with q,,. Note that in this definition we are combining the I-map of the RNN and the states of the DFA, which are not related a priori, but we will show that the two correspond in a well-defined manner. Definition 2.6. If I is a single symbol, i, then we call a 1-cycle of the i-map, {q,}, a fixed state oftke i-map. Definition 2.7. If I is the composition of a symbol, i, with itself n-times, then we say that Q, is a period n state orbit of the i-map. Definition 2.8. Call a state that is not in a period an i-map a transient state of the i-map.
13
state orbit ( n 2 1) for
Definition 2.9. For all I that contain at least two distinct symbols, we say that an n-cycle, Q1, is a jake period n orbit oftke 1-map. Such orbits are called "fake," because they can be made of transient states of the intermediate i,-maps. Definition 2.10. We say that an n-cycle is reduced if the 9, are distinct.
To speak of the RNN performing an FSM computation, we need to have some way of discretizing the output of the RNN. In particular, we need to have some way of determining whether the RNN has accepted or rejected the input string that it has read. To this end, we have the following definition. Definition 2.11. Let there be two disjoint, closed subsets of ZN, A and R. We call A the accept and R the reject region of the RNN or simply the accept and reject regions if we consider the network to accept the string when the network's state is in A and reject it when the state is in R.
If necessary, we can easily generalize this definition (and our theory) to FSM computations that have many distinct outputs. The reader should keep in mind that these sets are necessarily chosen before there can be any discussion of the physical system performing a computation. A particular choice for these sets determines which computations a family of dynamical systems can perform, and, as we shall see, the sets A and R will ultimately determine the organization of the internal dynamics of an R" performing an FSM computation. As a simple illustration of how the choice of sets will determine the computation, if one chooses the entire phase space to be, say, the accept region? then any family of ~
*We don't recommend this.
Mike Cnsey
1140
d ynamical systems will be performing an FSM computation with respect to that choice. Namely, they will perform the trivial computation in\ d v e d in accepting all strings, which puts no interesting restrictions on the system’s dynamics. The results in this paper will be interesting only in the case that reasonable choices are made for A and R , such as those we choose in the Examples section. For the remainder of the paper, let C be an FSM computation, .U,, be the DFA associated with r , and L,- be the regular language that is associated with C.
Definition 2.12. An RNN is said to
p7i:fiwii!
t h scirfrpiitntiori C or simply
p:f”m i’ if after reading any valid input string S the RNN state is in its accept region if S is in L,, and the RNN is in its reject region if S is not in L,,. We assume that all strings begin with an “empty” symbol, c will be in the language or not depending upon whether the empty string is in L,. or noti and S will be in L,. if S minus its initial c is in L,.. Hence an Ltrbitrar!. string S will begin lvith P and will be followed by any string in (1.
\’*
Nottx that we do not explicitly require the RNN to perform the computation in the same way that an FSM does. That is, we d o not require the IWN to explicitly model the states and transition functions of any particular DFA that performs the FSM computation. One of the main results of our theor!, is to shcnv that to perform the computation, the RNN rrriist organize itself in a way that corresponds to the organization of the minimal DFA that performs the FSM computation. Some authors (notably Giles r t 171. 1992) ha1.e used an c i i d symbol to “give the network more power in deciding the best final state S configuration.” This is equivalent to ha\.ing an additional hidden layer to output tunction (as in an Elman network). This can be useful in as much as it can sometimes allow an RNN ivith fe\z;er hidden units to model a given IIFA by finding its own representation of the accept and reject regions (which would be the inverse images of the accept and reject regions under the ml-map), but it will not add any computational power that a prudent choice of accept and reject regions would not also give. This is neither necessary nor useful for our examples, so MY do not use an cwd s!.mbol. Before continuing we will explain why it \till be desirable to consider only solutions with a finite amount of noise. The first thing is that physterns will always contain some noise. In particular in organisms g in and interacting with a complicated en\.ironment there will int.vitahly be perturbations of the organism from sources other than those that are directly related to the computation in which we are interested, m d we would like to consider only systems that can still perform correctly in these circumstances. In other words, it is desirable, for example, .. . __ ’T-he piirpose of t’ IS to a l l o ~ .the R N N to initi‘iiize its internal state to correctly ~ l a s i f vtht. empty string. ~~
~~
Discrete-Time Computation
1141
to consider only systems that can still add even though the temperature changes. A second reason to consider solutions with a finite amount of noise is that if we are implementing the system on a digital computer then there will be noise in the system due to round-off errors. So we would like to restrict our attention to solutions that work in spite of round-off errors. A third reason to consider robust solutions is that if we are using a learning algorithm to find the solution, then we would like to consider solutions that will be well behaved with respect to small perturbations in the weights (and hence tolerance of small perturbations in the input) since these are the solutions that can be learned or evolved (except possibly while using teacher-forcing, but a solution found in this way is likely to be destroyed when teacher-forcing is turned off). As will be seen in the example section, if we do not assume that there is some noise in the system, then it is possible to find RNN solutions to FSM computation tasks that will generically fail to work when given even the slightest perturbation to their weights, activations or inputs. To rule out these solutions that may require noiseless operation of the neural network or physical system, we will require that the computation be robust in the following sense: Definition 2.13. The system is operating in a noisy environment if, for some fixed t > 0, the state variables of the RNN, x(t), may be replaced by any point y ( t ) such that Iy(t) - x(t)l < f at each time step, t (where Iy - XI denotes the Euclidean distance between x and y). That is, we are concerned with 6-pseudo-orbits rather than the true orbits of our state variables under the action of the maps. For detailed discussions of 6-pseudo-orbits, consult Shub (1987) and Bowen (1975). Definition 2.14. Fix an 6 > 0, let f be a map of ZN into itself and let x E IN.An f-pseudo-orbit ofx underf is a sequence of points x ={xo. x l . . . .} such that K ( X , - ~ ) x,/< 6 for all i 2 0, where xo = x. Unless noted otherwise, whenever we speak of the iteration of a point we will assume that the system is operating in a noisy environment. Definition 2.15. If an R" performs an FSM computation C, and continues to produce the correct output when operating in a noisy environment for some fixed c > 0, then it is said to robustly perform C, or, if Lc is the regular language associated with C, the R" is said to recognize Lc. Again, since we are concerning ourselves with solutions that have been acquired through an optimization technique, our assumption that the solution be robust should almost always be satisfied when the RNN is well-trained. The reader may wish to jump to the Examples section at this point to see Example 2 for an illustration of the difference between robustly performing an FSM computation and simply performing an FSM computation.
1112
Mike Casey
See Devaney (1987) or Wiggins (1990) for definitions and background from dynamical systems theory. Since we are concerned with systems operating in the presence of noise, we will need to define an attractor for such systems, and to do that we need a notion of recurrence. For r-pseudo-orbits, the recurrent points are those that are f-pseudoperiodic: Definition 2.16. A point, s, is said to be f-psrudoprriodic for a map f if there exists an f-pseudo-orbit, s, such that s = x,, for some I I > 0. For the least such 17, we say that a point is c-~Jsrirdoperionic ofpriod 11. One set that is often useful for talking about c-pseudo-orbits is the chain recurrent set, defined to be the set of all points that are epseudoperiodic for all c > 0. Since we will always have a finite amount of noise in our systems, the following will be more useful: Definition 2.17. For a fixed c > 0 and a mappingf the f-c\ini?i rrci/urerit wt, 72, i f ] , is the set of all points which are r-pseudoperiodic for f. We will need some general point set topology for the following definitions and proofs of theorems. For the necessary definitions, one can consult Munkres (1975). In particular, one should be familiar with open and closed sets, accumulations points, and dense sets. Intuitively, the attractors of a dynamic system are the sets of points to which most points evolve under iteration, but there is no one universally accepted definition of an attractor. Milnor (1985) reviews some of the existing definitions and proposes another, but none of these is well-suited to noisy systems. We create the following definition, which seems to correspond to the intuitive notion of an attractor for systems with noise: Definition 2.18. A subset of R,!fi, A, is a n c - j ~ ~ ~ i r ~ f o n t f rorn ~brlinzlior tor if 1. it is a subset of 'R,(f) having a dense 6-pseudo-orbit that is maximal in the sense that there is no s that is dense in a set B of which A is c3 proper subset, and 2. A is a proper subset the r-pscirrloba.sir~qfnttrnitiou, which is the set of all points whose c-pseudo-orbits are eventually contained in A, and is denoted / ) ( A ) .
For the sake of understanding the results in this paper, one can think of attractors rather than 6-pseudoattractors since it follows from the definition that the existence of an f-pseudoattractor will imply the existence of an attractor for the noiseless system, and if the noiseless system's attractor is sufficiently robust (as attracting fixed points, in particular, generally are) then there will be a corresponding f-pseudoattractor for sufficiently low levels of noise. But one should be a little careful since it is possible to have an attractor in the noiseless system that will not correspond to an c-pseudoattractor for even a small amount of noise in the system.
Discrete-Time Computation
1143
For example, the much studied logistic map, x f + l = y x t ( l - xt),has an attractor in (0,1] for y in (1,4), but will not have an 6-pseudoattractor for any f > 0.5 (for example) since the F-pseudo-orbits of any point contain points that tend to minus infinity (we may perturb a point to kick it out of [0,1] and then leave it unperturbed, for example). The amount of noise necessary to destroy the attractor goes to zero as y approaches 4. It should be noted that it can be very difficult to concisely describe the set of points that forms an c-pseudoattractor, but this is not a problem for us since the details of their orbit structure are of no consequence for our theory. Intuitively, an 6-pseudoattractor is just a very robust attractor, and corresponds to an oscillation or a steady state of the system that is robust in the sense that a slightly perturbed system will settle to one of these asymptotic behaviors that is stable in the sense that once the system has been perturbed into one of these behaviors the ambient noise in the system will not be sufficient to kick the system out of that behavior [though changing the input to the system will often do so (and will necessarily do so in RNNs performing certain types of computation which we will describe)]. Definition 2.19. An I-map is said to be hysterefic or multistable if it has more than one c-pseudoattractor, and an R" is hysteretic or multistable if for some string of inputs I, its I-map is hysteretic.
Hence a hysteretic map is a dynamic system that embodies several different robust, nonlinear oscillations that are selected based upon initial conditions. Multistable systems have been used as associative memories, as in Hopfield (1982), and will be necessary in our systems to allow them to have very stable, long lasting memories. Definition 2.20. A transient of an I-map is a point that is not in R,(I map).
Using the above definitions, we divide FSM computation dynamics into two categories: hysteretic and transient computation. Definition 2.21. Hysteretic conzputation is computation that can be performed using periodic orbits (of periods 1, where a periodic orbit of period one is a fixed point) of the I-map of an R".
>
Definition 2.22. Transient computation is computation which must be performed using transients of the I-map. 3 Results
Our first theorem shows why clustering and FSM extraction succeed and provides a theoretical foundation for representation issues related to clustering and extraction. Intuitively, it says that if an R" performs an FSM computation, then it must organize itself so that it models the minimal DFA that performs the same FSM computation.
Mike Casey
1244
We remind the reader that c’ is an FSM computation, that Mc?is the DFA associated with C, and L,. is the regular language associated with C. Assume further that the set states Q of ,M, are given a fixed ordering so that we can speak of the states (7) of .Wc-.
Theorem 3.1. TIiP sfilk, spacr’ o f n i l R N N robiistly peqforriiiiiy: C I f r u s t l i n w mzifiiallY disjoiiif, clcisrd spfs Q,iclitli izoiiciirpty irifcriois, cor.icymiidiri~yto thr stntrs L),
Oi
This is not to say that the only thing that RNNs can do is to mimic DFA, since there are many RNNs that will not ha1.e the properties necv s a r y to enable them to ha\,e FSM beha\.ior. One property of what we are calling computation that will necessarily be violated is the condition that our computations are algorithmic. This is equivalent t o saying that the RNN with a given choice of A and R will not act as a mapping from the collection of input strings to the collection of output strings, since the RNN can have different outputs for the same input strings.‘ Therefore, behaviors that are unpredictable at any time are excluded from this definition of computation., Whether or not RNNs that violate this predictability condition are performing something that can be characterized as useful computation is left to be demonstrated, hut we conjecture that such examples can be easily found, and probably already exist in many places in the RNN literature and elsewhere. An anonymous reviewer has pointed out t o the author that in the noiseless case, these properties (excepting nonempty interior) would follow from a more algebraic approach, where points in the phase space are detined to be equivalent if they produce the same output (in (0.1)) for all possible input strings. This type of approach is an old one and is
~ _ _
.~ . _ ~ ~ _ _ _ _ _ _
’Note that this is ticit necessarily the result of chaotic dynamics or the low level o f n o w that $ye are assuming is in the system, but rather is a reflection of the interaction tern’s continuous dynamics tvith the discrete output This will be he FSM extraction section :Z definition that goes back to Turing (1936).
Discrete-Time Computation
1145
described in Arbib and Zeiger (1969), Kalman et al. (1969), and Nerode (1958), and more recently in Tino et al. (1995b). As discussed earlier, however, robustness is a very important property since it allows one to focus on understanding error correcting R" solutions that can be found using learning algorithms. Furthermore, robustness, we shall see, is essential for proving our second theorem, and, as mentioned earlier, when speaking of learned solutions, the robustness assumption should almost always be satisfied, so robustness is useful for putting our study in the proper context. It is for these reasons that we state and prove the theorem in this context rather than referring the reader to the literature for the same results in the noiseless case.
Proof. Since by the aforementioned results from automata theory, performing the computation is equivalent to performing the same function from input strings to outputs as M c , it is true that if we start M c in its initial state and give the RNN its empty string symbol input (to initialize it), and then both systems read an arbitrary string of inputs from {0,1}*, then the outputs of Mc and the RNN must be the same. Assume that for some state 9, there are no points in IN that correspond to 9, (that is, assume Q, does not exist for some j ) . This is equivalent to saying that there is a string S1 that, if read, will leave Mc in 9, (since M c is minimal, it is connected) and a string S2* such that if S 2 is then read there is no point in the RNN phase space (IN) that will produce the correct output for all substrings of S1S2 that will also produce the correct output for S1S2. But this says that the R" will fail to correctly classify S1S2 or one of its substrings, which contradicts our assumption that the RNN recognizes Lc. Hence the Q, exist. The Q, have nonempty interiors by the fact that the RNN is assumed to robustly perform the computation. To see that the Q, must be mutually disjoint, assume that there is some point, x, in the intersection of two of these sets, Q, and Q k for j not equal to k, but that the RNN still robustly performs the computation. Since M C is minimal, there exists a string, which we call F, such that after reading F, M c will be in a final state if Mc was in 9,. and in a nonfinal state if M c was in 9k. Now assume that the RNN is in state x. After reading F, the R" will be in either the accept or reject region or neither.9 Therefore, the R" fails to robustly perform the computation, which is a contradiction. Since the i-maps are continuous, the f-pseudo-orbits of the accumulation points of any set are the accumulation points of the t-pseudo-orbits of the points in that set. Hence it follows from the definition of the accept and reject regions as closed sets that the R" will
"Which may be empty. 'Since there is noise in the system, it is true that on any given run it will be in exactly one of these three regions, but we cannot immediately rule out that the region in which it is may change from run to run. This turns out to be the only essential difference in this proof for noisy and noiseless systems. Fortunately it does not change the results.
1146
Mike Casey
not fail for any of the accumulation points of Q l , and therefore the Q, are closed. 0 The fact that the Q, will be closed follows from the definition of the accept and reject regions as closed sets and the continuity of the i-maps. If we were to define the accept and reject regions as open sets, then the Q, could be open. If we did not require the i-maps to be continuous, then we would not be able to conclude that the Q,must be closed. We have chosen to define the accept and reject regions as closed sets so that, when combined with our assumption that the phase space is bounded, we may conclude that the Q, are compact. Compactness of the Q, will be used in the applications to FSM extraction, but is not essential for obtaining the extraction results since we could work with the compactifications of the Q, (which will still be disjoint by the robustness assumption and the continuity of the i-maps) and then see that the conclusion holds for the interiors of the compactified 0,. Robustness is not necessary for the proof of 3.1 as stated, except in showing that the Q, have nonempty interiors. Since for the rest of the paper it will be convenient to speak of a computation directly in terms of DFA, we make the following definition.
Definition 3.1. If an RNN robustly performs a computation C, then it is said to ~ o b u s t l yniodel -UL,. Theorem 3.1 shows that this is a reasonable definition. The following lemma will be the key to proving our next theorem. Its proof is in the appendix.
Lemma 3.1 (The Wandering Lemma). Suppose some closed set B with nonempty interior contains no behaviors. Then for any f > 0 and for any point s E B there is some finite number N such that the N"'point in some i-pseudo-orbit of x will leave B. There may be redundancy in an RNN solution in the sense that there may be several regions in the R" phase space that correspond to a single information processing state.'O In particular, there may be several behaviors in the network that correspond to a single fixed state of the i-map. For this reason, the following definition will be useful:
Definition 3.2. When there is more than one behavior in a single Q,, we say that these behaviors are in f h r m i l e class. If two behaviors are not in the same class, then we say that they are distinct. The following definition will allow us to focus only on points that potentially enter into the computation.
Definition 3.3. A point in the RNN phase space is said to be reachable if it is an 6-pseudo-orbit of some point in the image of the T K under Pmap , r-map(F"), or in an f-pseudo-orbit of any of these points for some "'Sincv the RNN may more explicitly model an unminimized DFA.
Discrete-Time Computation
1147
input 1 E {0,1}*. A behavior is called reachable if any of its points are reachable. Our next theorem goes deeper into the representation issue by giving a coordinate-independent property of the z-maps of the RNN modeling a DFA. This property can be used as a measure of complexity of the DFA in terms of the complexity of the R" that models it.
Theorem 3.2. Suppose that an RNN robustly models a given minimal DFA zoith afixed state of its i-map, 91, then i-map must have at least one distinct, reachable behavior contained in Ql. Note that this theorem is not true if the R" does not robustly model the DFA. We will give a counterexample to that conjecture with the examples. Proof. The proof of this theorem relies on the observation that we may test the RNN on a string of an arbitrary number of is, and reading such a string corresponds to iterating the i-map an arbitrary number of times. If we read a string S that puts the DFA in state 9, then after receiving the equivalent input, the RNN will be in some state x in Q,. Now read a string containing an arbitrary number of is, and give the corresponding input to the RNN. Suppose that Q, contains only part of a reachable behavior, A. By the fact that A must have a dense E-pseudo-orbit, some pseudo-orbit of x under i-map would then leave Q,, which is a contradiction. By the fact that i-map maps IN into itself, i-map will have at least one behavior. Then by Lemma 7.1, the complement of the union of behaviors of i-map is a nonempty closed set, and hence its intersection with Q, is closed. Call this set C. Suppose that Q, contains no reachable behaviors. It follows that after receiving S as input, the RNN will be in a state x E C. By the wandering lemma, if we then read a large enough number of is, some f-pseudo-orbit of x will leave C and enter one of the behaviors of i-map . This behavior is reachable, so by our assumption it cannot be contained in Q,, which implies that the RNN's state leaves Q,, which contradicts our assumption that 9, was a fixed state of the i-map. Hence Ql must contain a reachable behavior." 0 We have the following straightforward generalization of Theorem 3.2:
Theorem 3.3. Suppose that an RNN robustly models a given minimal DFA, M,. then an 1-map must have at least one distinct, reachable behavior for each of M,'s n-cycles. It follows from Theorem 3.3 that if an i-map has a period n state orbit and a period m state orbit, disjoint from one another (n may or may not "Note that the closedness of the Qj is not essential to the argument. We could have given a shorter argument that did not use Lemma 7.1 if we wanted to make o u r assumption that the accept and reject regions be closed more essential.
1148
Mike Case)
equal r i i ) then the RNN is hysteretic. (Note that the existence of periodic states depends only on the DFA, but the conclusion that the RNN is multistable is a statement about the RNN that models the DFA.) Corollary 3.2. Assrrtilr f l i i 7 f flit. iiiitii7l i[lc’i
Hysteresis implies the existence o f unstable inxwiant sets. Hence, if hysteresis is required for a computation, then one can deduce that the weights o f the trained RNN must be outside of some neighborhood of the origin. Hence, if we use the minimum distance from the origin in weight space as a measure of difficulty of modeling a DFA or a measure of geometric complexity of the RNN, then computations requiring hysteresis are more difficult to learn than those that d o not require it. The benefit of using the minimum distance from the origin in weight space to measure of geometric complexity when talking about RNNs is that it gives us an idea of how far the weights need to travel to achieve a solution. It is possible then that it will predict that the number of learning trials that are required to learn to model a DFA with high geometric complexity will, on the al’erage, be higher than the number needed for a DFA with low geometric complexity. There are, however, many problems with this measure. First, a means of calculating the minimum distance is not presently known, and it is unlikely to be an easy calculation even if our knowledge of RNN dynamics were much greater than it is. It is much more reasonable to hope to calculate lower bounds on the geometric complexity than to hope to calculate the actual minimal distance. Second, it is not clear that this quantity is well-defined, since changing the dimension of the solution (for example) may change the minimal distance. Another problem is that there is no guarantee that the distance that the weights must move is a good measure, since it is very unlikely that thrv would follow a straight path in weight space during training.
Discrete-Time Computation
1149
Even if a genetic algorithm with small random weight changes is used, the path may be far from straight since the overall fitness of the RNN will change discontinuously as we cross bifurcation boundaries for the I-maps. These are questions that require much further investigation, but preliminary experiments suggest that even this limited understanding of this measure may help in understanding the speed at which an R" can learn to model a given DFA relative to similar DFA, and in understanding why some R " s will converge during training and others will not. Kuan et al. (1994) have used a contraction mapping assumption to prove convergence results for R" learning algorithms. Their work (or extensions of their work) may be used in conjunction with ours to naturally explore learnability of given tasks. A related measure of geometric complexity of the R" that may be useful is the minimal number of behaviors for the I-maps. Using this measure we determine that R " s that model DFA that are hysteretic with respect to many distinct strings or have many distinct cycles for a given string will have high geometric complexity. It can be shown that at least in one-dimensional dynamical systems, one needs many degrees of freedom to produce families of dynamical systems having many behaviors, and, therefore, requires many parameters in the generating functions. It is reasonable to assume that the same is true in higher dimensions, but very little is known about the relationship between the number of behaviors of a dynamical system and other properties of the mappings in dimensions greater than one. There is some experimental evidence that the minimum number of behavior measures of geometric complexity may be useful for speaking about the difficulty of learning to robustly model a DFA. An example will be given in the examples section. At this point, it is useful to note that a fake periodic orbit can exist when its individual i-maps are contraction mappings, so the existence of a single fake periodic orbit for an I-map should not make it more difficult to learn to model a DFA (unlike a single periodic orbit), which is another reason to call them fake orbits.
Theorem 3.4. I f a minimal DFA has a state that is a transient state of the i-map then the i-map must move a point from the epseudobasin of attraction of one behavior to the t-pseudobasin ofattraction of a behavior not in the same class as fhefirsf,for some j-map, a composition of j-maps or the projection map, where the basins of atfraction of fhe projection map are the accept and reject regions. This is the weakest useful statement we can make about transient states. It is useful because it states the important feature that the i-map must act productively on transients. It is the weakest statement because it does not characterize the action of the I-maps on points in phase space corresponding to an arbitrary transient state, which we can characterize in terms of intersections of iterated preimages of t-pseudobasins of attraction under the i-maps.
1150
Mike Casey
The proof of Theorem 3.4 is to notice that if we are in a transient state for an i-map that is a recurrent point from the perspective of i-map dynamics, and if we then read a string of is to iterate the i-map and get back to the initial state, then the state of the DFA will have changed, but the state of the network will remain the same, contradicting Theorem 3.1. This idea will be further investigated in the section on FSM extraction. The distinction between transient and periodic states is important because periodic states are relatively well understood, and for RNNs with fixed discriminant functions, the dimension of the network needed to have a given combination of fixed and periodic states should be an answerable if difficult question. Transients, however, are not well understood, and the size of the RNN required to have certain kinds of transients is sensitive to the shape of the discriminant function. Empirically, large numbers of hidden units are not necessary for some DFA with large numbers of transient states, while the number of hidden units roughly scales with the number of fixed states and distinct periodic orbit states. The theorems in this section show that RNNs that robustly perform FSM computations have many constraints on their dynamics, so if it is possible to gain a familiarity with common attractor structures for autonomous RNNs," one may be able to combine this knowledge with our RNN phase space constraints to outline a heuristic algorithm for studying RNNs modeling known DFA. At least for low dimensional autonomous RNNs, it is possible to become familiar with typical dynamics,'' and for one- and two-dimensional networks one can even gain a very precise understanding of the types of dynamics that autonomous RNNs can have. Such an understanding has been outlined in continuous-time systems in Beer (1996),and for discrete-time systems in Tino ct 01. (1995a) and Casey (1995a), among others. Given these understandings, first, one examines the minimal state diagram of the DFA to find all cycles. By Theorem 3.3, these cycles correspond to behaviors required of the trained RNN that correctly recognizes the language. Given an understanding of the types of attractors that autonomous RNNs typically have, one can then roughly estimate the size of the RNN needed t o model the DFA, basing this estimate on the complexity of the necessary mappings rather than upon the number of states or some other measure such as the one studied in Siegelmann rt ill. (1992). Using our theory, one can sometimes even determine a strict lower bound on the dimension of an RNN that can model a particular DFA (as will be done for our examples), which complements the upper bound, which can be obtained by the methods in Siegelmann cf a/. (1992). The size of the RNIV needed to solve a given problem can depend critically on the discriminant function one chooses. If we allow the dis"Or analogously, if one is able to become familiar with the common dynamics of the phvsical system that is to be performing the computation when the system is receiving fixed inputs. ' 7 0 n e ran get a sei~seo f what autonomous RNN. "like" t o do.
Discrete-Time Computation
1151
criminant functions to be arbitrary by letting each input vector correspond to an arbitrary dynamical system, then it is always possible to model a DFA using only one-dimensional maps. It is easy, given a minimal DFA diagram, to construct by hand a family of dynamic systems that will robustly model that DFA. This fact suggests a more judicious choice of a family of functions to use to model DFA while inferring a DFA from examples, specifically, a family of functions whose complexity depends on degrees of freedom in choosing the one-dimensional maps, rather than in the dimension of the maps. The use of one-dimensional maps would greatly simplify the analysis we describe for determining the RNN's performance. The use of one-dimensional maps may also reduce the scaling of the complexity of the RNN with the complexity of the DFA since increasing the dimension of the RNN increases the number of parameters cubically for second-order RNNs, but the number of parameters should increase only linearly with the degrees of freedom for a well-chosen family of one-dimensional maps. This question requires further investigation. The next step in our algorithm is to determine whether or not to use teacher-forcing during training by considering the number of behaviors the RNN must have. One should make sure that there are a sufficient number of training strings that contain iterated substrings corresponding to the cycles of the minimal DFA diagram. After training, one examines the stability of invariant sets of the I-map, including their f-pseudobasins of attra~tion.'~Finally, one checks that transient states map between appropriate 6-pseudobasins of attraction of the behaviors. If the trained R" performs correctly during these tests, then it will generalize to all unseen strings. If the DFA is not known, and we have access to internal variables, then one can use this theory, possibly in conjunction with extraction techniques of Crutchfield and Young (1989b)or Giles et al. (1992),to determine whether or not the RNN is performing any FSM computation, rather than just approximately performing it and potentially failing at some time. It should be noted that there may be practical problems with checking stability and with other measures of R" performance, especially in high-dimensional RNNs. Hence, applying our theory will not always be straightforward, but this is not necessarily a shortcoming of the theory. One reason is that future advances in dynamical systems theory may lead to tools that overcome these practical problems, so the shortcomings more properly belong to dynamical systems theory than to ours. A second reason is that it seems that any analysis that attempts to yield an understanding of RNN performance is likely to run into very similar complexity based problems. Another problem with our theory that seems to require advances in dynamical systems theory is the problem of understanding complexity -
14Thisinvolves finding the "shapes" of unstable invariant sets, which is very difficult for high-dimensional mappings.
1152
Mike Casey
in networks that contain few or no cycles. Some languages that contain no cycles require that the RNN that models them be a collection of very complicated dynamical systems, but the precise sense in which they are complicated seems very difficult to describe. Again, it seems that the minimal number of DFA states is a poor measure of the difficulty of learning to model such DFA, but in this case, it is not clear that the measures that we have proposed will distinguish one such DFA from another (and in some cases, our measures clearly will not), and therefore cannot be used to rank them according to their difficulty to learn. Hence, while our theory does work well in certain cases, it is known to be incomplete. As a first small step in overcoming this problem, or possibly just to demonstrate that the problem exists, an example will be given in the following section that shows that it is sometimes constraints other than necessary attractors that necessitate moving to higher dimensions in order to find an RNN that can perform a given computation. 4 Examples
For the following problems, similar to the format in Giles et ul. (1992), the RNN receives a unary encoding, with one input line for each character in {c. 0. l}, where r marks the beginning of a string and corresponds to the empty string [rather than denoting the end of a string as it does in Giles et nl. (1992)l. The task is to recognize a regular language by robustly modeling a given DFA. The accept and reject regions are the regions of the phase space that project within 6 of 1.0 or 0.0 on the target unit axis, for some 0 < 8 < 0.5. In the captions, &%,, denotes the weight from unit j to unit i, where b denotes the bias (constant input of 1) unit. 4.1 Example 1. A problem that uses hysteretic computation is recognizing the 0*1* language.I5 The minimal DFA that recognizes the 0*1* language is shown in Figure 1. We see q 1 and q y are fixed states of the 0-map and q2 and q3 are fixed states of the 1-map. The arrows in Figure 2 show the direction that the point at the base of the arrow moves under iteration of the map. The distance that the point is moved is not shown, except that the relative distances are represented by the lengths of the arrows. In the case of the e-map, the point moves \ w y close to the fixed point of the map (the point at which all arrows point). Note that the right side of the phase space (as you are looking at it) in the figure is the accept region (i.e., the region that projects to the interval [l - h. 11) and the left side is the reject region, so the e symbol sets the network to accept, as it should. The 0-map has two attracting fixed points (one in the accept, and one in the reject region, corresponding to q1 and q 3 , respectively), and one "1, could also be examined, but its simplicity makes it less instructive
Discrete-Time Computation
1153
Figure 1: The diagram for the minimal DFA that recognizes 0*1*.
saddle-type fixed point. The curve that separates the basins of attraction for the two fixed points is the stable manifold of the saddle point, and is an example of an unstable invariant set (so one never "sees" it under iteration of the 0-map). The unstable manifold nevertheIess can be found, and plays an important role in computation, since changing its shape and position can make the network fail to solve the problem because it forms the boundaries of the basins of attraction for the attracting fixed points. If we bent the right part of the curve down until it intersected the middle of the bottom side of the phase space, for example, then 0000001111111100000 would not be rejected, as it should be since the basin
1154
Mike Casey
.............................. ............................ ..............................
4
.............................. .......................................... .....................
., ..... ,, ..... .. ,, ,. ,.
..................... ...................... ....................
.....................
,.,, ,,
CCCCCCCCCCCCCCII.,..,
..................... C.CC~~IC..CC..,,,,,,, ..C..C00000,,,,,,,,,,
I
////,/,,,,0,,,,,,,,,,
,
,, ,, I
///.,00000,I,,,,,,,,,
,
,/,,,, I,,t ,,,,,, I t 1 1 1 1 1 1 I 1
1 / 1 / 1 1
I ,
/ 1 1 1 1 1 1 1 1 1 1 1
I 1 I I
111111
1 I I 1
/ / 1 1 1 1
11111l 111111
111111
111111
1 1 t 1
....... I ....... a , . . , . .
I I
I I -
....... ::::: p::::: ....... ............. ....... ...................... ....... , \ , , , * * I t + . . , * , , I , * ,
, , , . , I * , , ,
.......:::::3:::::::::::: : : : :
............................ ...........................
I:<:
I
::::::::::*::: :::::::: : : :
I
Figure 2: (a) A representation of the phase space dynamics of the e-map of a two-unit network that recognizes the 0*1* language. Weights: wlb = 3.212, dll = 6.415, w12 = 5.615, w2b = 2.101, w21 = 2.161, w22 = 2.107. (b) Phase space dynamics of the 0-map of the network that recognizes the 0*1*language. Weights: w11, = -7.066, w11 = 2.789, w12 = 9.168, w2b = -2.217, w21 = 0.756, w22 = 3.743. (c) Phase space dynamics of the 1-map. Weights: W l b = -4.874, w11 = 9.386, ~ 1 =2 2.458, w2h = -1.612, w21 = -2.257, w22 = 0.115. (d) Approximations to the Q, have been drawn directly in the phase with transition arrows showing that the organization of the RNN is the same as the minimal DFA recognizing the same language. The scattered points in the Q, are approximations to the reachable points of the RNN.
Discrete-Time Computation
1155
of attraction of the fixed point of the 0-map which is in the accept region would contain the fixed point of the 1-map which is in the accept region. The 1-map has two attracting fixed points (one in the accept, and one in the reject region-corresponding to 92 and q3, respectively), and one saddle-type fixed point. Again, the curve that separates the basins of attraction for the two fixed points is the stable manifold of the saddle point. Note that the "accept" fixed point for the 1-map is in the basin of attraction of the "reject" fixed point of the 0-map, as it must be, since, for example, 111110 is not in the language. Figure 2D shows the dynamics of the three maps superimposed upon one another, and shows very approximately what the Q, look like. Precise descriptions of these sets are given later, in the FSM extraction section. Labeled arrows have been drawn between these sets to show that in this case, the RNN solution is essentially an embedding of the minimal DFA that recognizes the language. In general, we should not expect an embedding of the minimal DFA, but rather of some unminimized version of the DFA. 4.2 Example 2. A problem that uses only hysteretic computation is recognizing the language, Lz, containing the words with 3k + 2 Is for some k 2 0. Figure 3 shows the minimal DFA for this language. The 0-map for this RNN is hysteretic. The DFAs states are a period three-state orbit of the 1-map. Hence, to model this DFA, an RNN might have a 0-map that has three attracting fixed points, and a 1-map with an attracting period three orbit, where two points of this orbit are in the reject region, and the other is in the accept region, and the attracting fixed points of the 0-map coincide, or nearly coincide, with this period three orbit. This prediction does, in fact, turn out to be correct for the 10 training trials that we have run, and seems to be likely to be the solution that will be found using a two unit second-order R" with any reasonable learning parameters. One of these solutions is seen in Figure 4. With a different set of neural network equations, or possibly when using more units, more complicated (complicated in the sense of attractor topology) solutions may be the norm. For these experiments, we generated random input strings of varying lengths, where the next symbol was 60% likely to be a zero, and 40% likely to be a one. The reason for this choice of string distributions was that in order to learn to robustly model the DFA it is useful to test the network on longer strings that correspond to hysteretic i-maps. However, periodic state orbits may also be difficult to learn, so in this case we want longer strings of zeros, but we should not emphasize them too much since we also need to learn a period three-state orbit for the 1-map. All of the examples that we studied used two unit R " s , and we required the RNN output to follow the DFA output at every time step, rather than simply requiring it at the end of the string. Teacher forcing was used and the learning rate was set to 1.0.
Mike Casey
1156
Figure 3: The minimal DFA that recognizes the language containing all words in { 0 . 1 \ * that contain 3k
+ 2 1’s for some k 2 0.
The r-map puts the netwwrk in the part o f the phase space that will go to the accept region after two iterations of the 1-map. The 0-map has three fixed states, corresponding to the three fixed points in the picture, and two saddle points. It will be seen that the fixed points nearly ”sit above” the periodic orbit of the 1-map. The I-map has a period three orbit. It is more instructive to look at the dynamics of the third iterate of the 1-map, [( 1-map)’], rather than looking directly at the 1-map and trying to see that the arrows will all match up to give this orbit. The attracting periodic orbit shows up as fixed points for (1-map)?,and the unstable period three orbit contains the saddle points. The existence of a periodic orbit implies (by Brower’s plane translation theorem) the
Discrete-Time Computation
l_l
1157
...............................
............................. ............................. ............................. ............................... .............................. .............................
............................. .............................. .............................. .....*,....,. ... \.\.............. ... ............... ............. I: : :: : ::: ... ... ..\............... ,
,
............. .................. \.\ ,\, .............. ,,\ .\\
.....................................I
* \ \
\,\
$ \ \
\ \ \
............. ,-............. . ..............
...*_-*r--CCCr-C~
.
,
,
,
,
,
,
,
,
,
,
,
.
,
0
,
,
~
,
~
~
~
~
~
~
~
-
I \ \ \ \ \ I \ \ \ \ \
\ \ \
<\\
\ \ I \ \ \ \ \ \ \ \ \ \ \ \ \\\
\ f \ I \ \
\ \ \ \ \ \
I \ \ \ \ \ \ \ \ \\\ \\! \\\
-
Figure 4: (a) Phase space dynamics of the e-map of the network that recognizes the language accepted by the automata in Figure 3. Weights: wll, = -3.964, d l l = -0.978, w12 = -6.896, w2!, = 1.799, 4 1 = 2.927, 1322 = 3.359. (b) Phase space dynamics of the 0-map. Weights: W I ~ = , -3.967, w11 = 8.861,312 = -7.478, d 2 b = -3.087, w21 = 0.113, w22 = 6.133. (c) Phase space dynamics of [(I-map)']. Weights: "I!, = 4.495, w11 = -7.922, w12 = -9.902, WI~, = -2.923, w21 = 5.278, ~ 2 = 2 0.151.
existence of a fixed point. In this case, the fixed point is the point at the intersection of the three basin boundaries (stable manifolds). If two of the fixed points of the 0-map were in one basin of ( l - m a ~ )then ~ , the network would not be able to model the DFA since a word containing
1158
Mike Cnsey
many zeros between each one would put the 1-map’s period three orbit “out of phase” with the DFA. To see the utility of using c-pseudo-orbits rather than true orbits for our theory we will now consider a solution to modeling this DFA that uses unstable invariant sets and discuss its properties. Suppose that Pmap is a constant map, mapping all points to the fixed point of the 0-map that is in the upper left-hand corner of the phase space. Further suppose that the fixed points of the 0-map are unstable (repelling rather than attracting) and that the periodic orbit of the 1-map is unstable and maps the three fixed points of the 0-map to one another. So rather than approximately sitting above the fixed points of the 0-map, the periodic orbit of the 1-map lies exactly above them. This RNN clearly models the DFA, but it does not robustly model the DFA since if there is any perturbation to the orbit, then even a string of se\Feral zeros with moderate perturbations at each step is likely to fail to produce the correct output, so the RNN will generally very quickly fail to perform the desired computation. Furthermore, if we perturb the maps at all and move the fixed points or the periodic orbit, then the RNN will fail to model the DFA. From this it follows that it is exceedingly unlikely that such a solution will be found while evolving the RNN. It is also clear from this example that Theorem 3.3 does not hold for RNNs, which simply model rather than robustly model a DFA. One can use the intermediate value theorem to prove that a oneunit RNN with a sigmoidal discriminant function (as a one-dimensional monotone dynamical system) cannot have these dynamics, so we know that the two-unit RNN solution shown in the figures is minimal with respect to network size. We emphasize here that the solution that the R” found could have had complexities that the simple prediction did not consider. Besides the possibility of having many behaviors in the same class (that is, the RNN may have modeled an unminimized DFA), it is also possible that rather than using fixed points, the RNN could have had ”small” chaotic attractors (small in the sense that they stay in appropriate future basins of attraction), that might lead one to believe that those dynamics are somehow necessary for the computation. So not only might there be many internal states corresponding to the same computational state, but any of those individual behaviors might also be complicated. What might be true when we observe more complicated dynamics is that, while they were not necessary from a computational perspective, they were necessary for topological reasons from the perspective of the map dynamics, or when observing nervous system behavior that they were necessary for biological reasons. Another question that may be investigated using our work is whether or not teacher-forcing should be used to learn a known DFA. Teacherforcing seems to work well for learning to perform hysteretic computation, but can lead to difficulties when used to learn transient computation. One potential problem with teacher-forcing is that by restricting
Discrete-Time Computation
1159
activation dynamics to a subset of the phase space, it may make the desired computation more difficult or even impossible. For example, it is possible to recognize L2 with a one-dimensional RNN with discriminant functions more complicated than sigmoids. If, however, one attempted to use teacher-forcing during training, the problem would not be solvable in one dimension. Teacher-forcing would restrict the R" to mapping between two points (one representing the fact that the DFA is in a nonfinal state, and the other representing a final state), and therefore it would be necessary for the 1-map to map the nonfinal state point to itself and also to the final state point, which is impossible. If we move to a twodimensional network then this problem with teacher-forcing disappears, but related problems may exist. 4.3 Example 3. Our final example language, L 3 , is the set of all strings that begins with 0 and has no 00 or 11 substrings. Figure 5 shows the minimal DFA for L3, where states 91 and 9 2 are transient states and 93 is a fixed state of both the 1- and 0-map. We also see that { q 1 . q 2 } and {93. 93) are fake periodic orbits for the 01-map. Again, we study a solution which uses two-units. The e-map puts the network in the part of the phase space that will go to the reject region after two iterations of the 0-map or one iteration of the 1-map. The 0-map has only one fixed state, and its corresponding fixed point is in the reject region of phase space. To recognize the language then, it must use transient computation. Points near the bottom right side of the phase space iterate to points in the upper right side of the phase space (still in the accept region), which are then mapped to the reject region (hence we correctly reject words containing two consecutive zeros). This is shown in the figure by showing the image of a circle initially in the region of phase space to which the e-map sends all states. The 1-map complements the dynamics of the 0-map so that the composition of the two (the 01-map) will have a fake period two orbit in the accept region, since points on the upper right side of the phase space are mapped to points in the lower right side. Note that if one considers the 01-map to be a dynamical system and iterates it like any other (i.e., without watching the state variables after applying only the 0-map) then a fake periodic orbit will be a single attractor (such as a fixed point, as it is in this case). This is seen in Figure 6d. If we looked at the 10-map, then the fixed point would be in the lower right hand corner. It is shown in the appendix that if we use sigmoidal discriminant functions and choose our accept and reject regions as we have in our examples, then a one unit R" cannot recognize L3. This is especially significant since it shows that transient computation can increase the difficulty of recognizing a regular language, since there is nothing about the hysteretic computation (namely, the fact that the 01-map must be multistable) that rules out a one unit R" solution. It is true, however, that if one chooses the reject region to have two connected components and for
1160
Mike Casev
Figure 5: The minimal DFA that recognizes the language containing all words with no 00 or 11 substrings that begin with 0. the accept region to lie between those components, then it is possible to have a one unit solution using monotone discriminant functions, but it is not clear n priori that one should do so, so I claim that the theorem reasonably demonstrates that transient computation increases the difficulty of learning to recognize L3. 4.4 Geometric Complexity Measure Example. Next we use our theory to study the difficulty of training an RNN to robustly model a given DFA. We study DFA - t / t l and -M: shown in Figures 7 and 8, respectively.
Discrete-Time Computation
1161
.......e ..... .......
1 ....... ....... ....... ....... ....... ....... ..>A*..
.............................. :::::: : : : : : : : :I , , . . I . .
I::::::::::::::::
. , , . & , I
..................... .....................
~”””“,,””””,”
I:............................ :::::::::::::::f:I:IT::TxTffl
Figure 6: (a) Phase space dynamics of the r-map of a two-unit RNN that recognizes the language L3. Weights: ulf, = 4.431, ”’11 = 1.085, “‘12 = -1.742, w21, = 3.243, w21 = 0.984, “‘22 = 1.639. (b) Phase space dynamics of the 0-map. Weights: ulf, = -7.924, w11 = 3.751, “‘12 = 9.286, w2[,= 0.236, u21 = 0.236, w22 = -3.564. (c) Phase space dynamics of the 1-map. Weights: d l ~=, -2.368, w11 = 7.012, w12 = -9.145, w2f, = 0.347, w21 = 2.873, “22 = -4.019. (d) Phase space dynamics of the 01-map.
Both are minimal, connected, and have five states. The only difference between the two is the effect of reading a zero when in the third state. For M 1 ,the effect will be to send the DFA to state five, and for M2 the effect will be to return the DFA to its first state. We should emphasize here that these results are very preliminary, and suggest only that the
1162
Mike Casey
Figure 7 : DFA diagram for .MI.
minimal number of behaviors of an RNN modeling a DFA is a useful measure of geometric complexity. Learnability or "difficulty" of learning is a very difficult problem to study theoretically. To determine the minimal number of behaviors of an RNN modeling a DFA, we study the DFA minimal state diagram to find all reduced cycles.'" If an I-map has a reduced cycle and at least one other cycle which is not necessarily reduced, then the RNN's I-map will be hysteretic and have an 6-pseudoattractor for each of the cycles. Since we need only focus on reduced cycles for determining the minimal number of behaviors, we need only study the strings whose lengths are less than or Equal to the number of states in the minimal DFA. Bv looking for reduced cycles in .Mi's diagram, we see that only its 01-map is hysteretic since ( 9 4 . q 3 } and { q 5 . 9 5 ) are both periodic state orbits of the 01-map. ,Mz however, is hysteretic for its 0-map, its 01-map, and its 001-map (001-map requiring three attractors). For each of these DFA, we ran 10 training sessions and used our theory to determine when the R" had learned to robustly model the DFA. For these experiments we used the same training methods as in "'IZemember, a reduced cyclt, is one for which the states in the state orbit are distinct.
Discrete-Time Computation
1163
Figure 8: DFA diagram for M z . the other examples and used a learning rate of 1.0 with teacher-forcing. All of the networks had three units and the error tolerance was 0.3. The training strings were generated randomly for each trial from a uniform distribution of bit strings. After each input symbol was generated, there was a one in ten chance that a new string would begin. MI was learned in an average of approximately 17,000 iterations of the learning algorithm (that is, after reading 17,000 symbols). M Z was learned in an average of approximately 24,500 iterations of the learning algorithm for the eight trials where the R" converged. An RNN was deemed to not converge if it did not robustly model the DFA after 300,000 iterations. R " s modeling MI converged on all ten trials. An interesting aside is that R " s that robustly modeled M Z often used attractors other than fixed points to represent 9 5 as a fixed state of the 0-map. Periodic maps (period greater than one, specifically, periods two and four) and quasiperiodic attracting limit cycles were observed. A possible reason for this is that the 0-map must have an attracting periodic orbit in the reject region, and autonomous R " s with symmetric or near symmetric attractors seem more common than those with asymmetric attractors. Hence the more complicated dynamics are a result of regularity
1164
Mike Casev
in the autonomous RNN d!mamics, and not because they were required for the computation. 4.5 General Discussion of Analysis. As noted in Giles t>t nl. 1992, Watrous and Kuhn (1992), and Tino cf 171. (1995b), to check an RNN's ability to generalize to unseen strings, it is important to check the dynamic stability of the RNN since the lack of stability of fixed states is likely to cause errors in network performance when the RNN is checked on long, unseen strings. Our theory allows for a more precise and general determination of this stability. While Giles rt (21. 1992 equate generalization to long unseen strings with fixed point behavior for fixed and periodic states, our theory shows that more general attractors may be used for the same purpose and that stability must be checked for all closed loops in the minimal DFA diagram. More importantly, since Giles rt 01. 1992 and Watrous and Kuhn (1992) check for fixed point behavior by reading A long pre\.iously unseen string [Watrous and Kuhn (1992) use strings of length 1001, they falsely assume that if the RNN produces the correct output for a string of length N for N sufficiently large, it will necessarily produce the correct output for strings of arbitrary length." It is possible and seems to be fairly common for an RNN that has been trained t o recognize a language L that requires hysteretic computation to use transients of its !-maps rather than true behaviors to correctly classify some strings in I., which leaves open a chance that the network will fail on any symbol after reading some large number of symbols. If, for example, an RNN's I-map must be multistable to model a given DFA, and is becoming multistable via a saddle node bifurcation, then just before bifurcating the /-map will map points near the site of the new fixed point very close to themselves. This slight imprecision may he "oxwlooked" by future maps so long as the RNN's state stays near the site of the new fixed point, but if we read longer and longer strings, the RNN's state will eventually mm'e to an actual behavior of the Imap. Since other "almost" invariant sets will behave similarly, we should generically expect that during training, RNNs will perform correctly on short strings just before robustly modeling a hysteretic DFA, and that one will be able to extract the correct DFA before one robustly models that DFA. This is the mechanism of error accumulation discussed in Pollack (1991). How RNNs use transients of i-maps rather than the necessary stable representations to perform correctly on short strings is studied in more depth in Tino et 01. (1995b). As an interesting aside, one way to help avoid having near invariant sets work to correctly classify strings in a finite training set without robustly modeling the desired DFA may be to add noise to the phase space "W't. agree that it is true that if an RNN performs corrtctly on very long strings, then will likely have the necessary behaviors, but in unpublished experiments we have wen unstable representations that (to our surprise) did eventually fail even though the btate variables were not perceptibly changing for more than 70 iterations. it
Discrete-Time Computation
1165
variables during training, since that would generally cause the near invariant sets to more quickly fail to effectively serve where true behaviors are needed. This would also make the network more likely to find error correcting solutions that are insensitive to noise in the input, since it would force the c-pseudoattractors to be more strongly attracting, and would make the i-maps more contractive in regions of phase space corresponding to transient states larger or at least make those regions larger. Another important result of our theory that we feel is worthy of more careful consideration is the fact that one must check the stability of fake periodic orbits. Fake periodic orbits have not been discussed elsewhere (to the knowledge of the author), but they are important since a lack of dynamic stability of periodic orbits and fake periodic orbits can be an even more likely source of R" failure. One reason is that this lack of dynamic stability is less likely to cause errors while checking network performance by reading randomly generated long previously unseen strings since iterated substrings of length greater than one are less likely to appear in randomly generated training data sets. For example, if the 011001-map is hysteretic, then the chance that one would randomly generate a string containing this substring concatenated with itself N times is Hence if one is not careful when initially checking performance of a trained RNN, one may miss the fact that it does not robustly model a DFA. The problem may be further exacerbated by the fact that if the RNN has a near invariant set that allows the RNN to perform correctly on short strings, then there will be little information in the gradient to push the R" through the bifurcation that it needs to form a dynamically stable representation of the loop. In other words, there is little reward for forming a stable, rather than nearly stable, representation, so unless one trains the RNN with strings containing interated substrings corresponding to loops in the minimal DFA diagram (which, as we have seen, will be very unlikely if the strings are randomly generated with a uniform distribution), one can generally expect the RNN to fail to produce a robust solution if training is stopped too quickly after the RNN appears to have learned to model a DFA. Thus we see that observing global phase space dynamics can give a clear understanding of how the RNN operates, and in particular allows for a complete understanding of the RNN's ability to generalize to strings of arbitrary length by determining the properties that the individual i-maps and compositions of these maps must have to ensure that the network will recognize the entire language rather than simply recognizing some finite subset of the language. We can also, in principle, use our theory to see if an RNN with unknown task structure is performing an FSM computation. By performing a finite number of tests, we can determine whether or not the RNN will perform correctly for an infinite number of computations. For additional examples of dynamical systems analysis of RNNs modeling DFA, including larger DFA, see Tino et al. (1995b).
(h)".
Mike Casey
1166
We claim that this approach is superior to observing a time series of the activations or clusters of activations since these give an incomplete understanding of the performance of the network. For example, they do not demonstrate the robustness of the solution or say how it will perform on strings of arbitrary length. Furthermore, traces from phase space say little about the ”difficulty” of evolving the ability to perform the task since some DFA with n states require more complicated dynamic systems to model them than d o other rz-state DFA. Another common approach to understanding RNN solutions is to study connectivity strength patterns. This approach is analogous to a common neurobiological approach used to understand networks of neurons, and has many problems.’8 Some of those problems stem from the fact that there exist many nontrivial symmetries that can act upon the system while preserving the dynamics, some of which are discussed in Casey (1995a). Other problems come from the fact that this approach is analogous to attempting to look at the parameters of a high-dimensional dynamical system and infer from them the dynamics. If the system were linear, this would be equivalent to looking at a matrix and directly, i.e., without explicit calculation, determining its eigenvalues (at least up to multiplication by a positive real number). Even for 2 x 2 matrices this is an unrealistic hope. For nonlinear systems it is even more rare that one can gain insight into the dynamics from observing the parameters, even after considerable effort to determine dependencies between parameters and dynamics, as in Casev (199%) or Wang (1991).
5 FSM Extraction
_ _ _ - _ _ _ ~ . -
-
Several authors ha\re proposed means for producing discrete machines from systems with continuous state variables [see Crutchfield and Young (1Y89b) and Giles ct [7/. (1992)l. Our theory provides a theoretical foundation for studying finite state machine (FSM) extraction from RNNs as defined in Giles ef a / . (1992), and, in particular, allows us to study two of their conjectures.” As noted in Kolen (1994), FSM extraction was developed based upon the assumption that RNNs simulate FSMs in their state dynamics (which is not obviously the case). Theorem 3.1 shows that their assumption is valid if the RNN has robust FSM behavior. That is, if the RNN can be used to recognize a regular language, then it irzzist simulate an FSM in its state dynamics. The first of the conjectures in Giles cf a / . 1992 is that during training the network begins to partition its state space into fairly well-separated, distinct regions or clusters that represent corresponding states in some ‘’For biological networks of neurons, these problems often stem from unconsidered variabltxs such as neuromoduiation and complex internal dynamics, but in the RNN ic%ce, we aloid this kind of uncertainty. ‘“Note that they study FSM extraction, meaning that the FSM that they extract can h c either a DFA or a nondeterministic finite automata (NDFA).
Discrete-Time Computation
1167
finite state automaton. Theorem 3.1 shows that this is true if the R" is to perform an FSM computation. Another of their conjectures is that for what they call "well-trained" R " s , the extracted minimal DFA are independent of the choice of partition parameter. Note that one must have a theory such as ours that allows for an understanding of all of the uncountably many possible RNN solutions to a given language recognition task in order to productively study these conjectures. Giles et al. 1992 define FSM extraction as a four-part procedure: (1) cluster FSM states, (2) construct a transition diagram by connecting these states with the alphabet-labeled arcs, ( 3 ) combine transitions to form the full digraph loops, and (4) reduce the digraph to a minimal representation. They partition the state space by dividing each unit's range into q partitions of equal width. Before proving the second of the aforementioned conjectures, it is instructive to examine a technical aspect of FSM extraction. As mentioned in Kolen (1994), after Crutchfield and Young (1989a), it is possible for a completely deterministic R" to yield a nondeterministic FSA (NDFA) during extraction. Kolen uses the case where there is sensitivity to initial conditions to demonstrate the point, but this is not necessary since a nondeterministic interpretation will be appropriate whenever one of the i-maps takes points in one partition state to points in more than one other partition state (which can be caused by a simple translation, and does not depend upon local map dynamics, though the problem should be more prevalent in locally expansive systems20 since they will yield NDFA interpretations for any sufficiently fine partition). We can reduce the problem of FSM extraction to extracting only deterministic finite state automata if we make two partition states equivalent if they contain reachable points that are in the image of a single partition state under the action of some i-map. This is a reasonable thing to do since this partition state equivalence must correspond to information processing (IP) state equivalence for RNNs that perform some FSM computation if q is large enough, as will be shown below. If q is too small, then we will generally extract an NDFA that will accept more strings than the R" and more strings than NDFA which are extracted using a larger q. In general, if an R" has FSM behavior, then it has a welldefined DFA description. If it does not have FSM behavior, then it will not have a well-defined DFA description, but it may yield a useful FSM if the states are discretized. Extracting NDFA from poorly trained RNNs is a very imprecise process, but the general utility of this process seems to be an empirical question. Theorem 5.1. For R N N s that robustly model a given DFA, for suficienfly large
q, it is suficienf to consider only partitions created by dividing each unit's range into q equal partitions to always succeed in extracting the DFA that is being modeled. '"Expansivity is necessary for the existence of sensitivity to initial conditions.
Mike Casey
1168
This may be a nontrivial statement given that basins of attraction can have very complicated fractal structure (see Grebogi r t al. (1987).
Proof. Theorem 3.1 says that there are a finite number of mutually disjoint, closed (hence compact) subsets of Is,Q,, that correspond to the states of the DFA. By Lemma 7.4 of the Appendix, we can separate any of the Q;from the others (since their union is a compact set) if the partition parameter is a power of two greater than or equal to some 2",. Hence, if we choose the largest of these N,then no partition state containing points in one of the Q,will contain points in any of the others, and therefore they will be separated from one another. From the correspondence between the Q,and the 9, it then follows that extraction will work. Theorem 3.1 then implies that for a sufficiently large partition parameter the partition states containing points in any Q, will not contain points in any Qk for j not equal to k. From Theorem 5.1 and the uniqueness of minimal DFA (up to isomorphism) for a given regular language, it immediately follows that the minimal DFA will be independent of the choice of the partition parameter for q sufficiently large if the RNN robustly models a DFA. We have now proved the second conjecture from Giles (11. (1992). It follows that it is sufficient that an RNN exhibit FSM behavior in order to extract a Lvell-defined DFA. It is not, howelrer, necessary that an RNN exhibit FSM behavior, since, as observed in Giles ct al. 1992, it is sometimes the case that an extracted FSM will outperform the RNN from which it was extracted. In these cases, however, the extraction process would not be stable in the sense that finer partitioning would yield different FSM. For sufficiently large q, extraction should fail to produce the desired FSM, and would produce FSM that more closely reflected RNN dynamics. If we are interested in a more precise description of the states to be extracted, then we can gain this insight by returning to a comment at the end of the Results section. Specifically, that the structure of the Q, can be cbxpressed in terms of the inverse images of the accept and reject regions under the i-maps . We will now give one example of how this is done. Suppose that we hare an RNN that is modeling the finite state machine in Figure 1 (accepting 0*1*). Consider first the noiseless case. As stated in Theorem 3.1, the points Q:that correspond to q7 are those points sF1 ' such that if the RNN is initialized with .Y then for any given input string the RNN will produce the same output as .Vc,initialized to state y i . That is, they are the points that will stay in the reject region for all input strings. We can describe this set symbolically as follows." c T f
"1Zemc.mber that f IS) f o r a function f and set 5 is the set of all p i n t s 5. 5.
t,\;
I
such that
Discrete-Time Computation
1169
Let To = R n (0-map)-'(R) n (1-map)-'(R), and let T,, = R n (0-map)-' (TI!-,)n (1-map)-l(TI,-1) for n 2 1. That is, To is the set of all points in R that stays in R when either 0 or 1 is read, and TI,is the set of points in R that stays in R when any string of length n + 1 is read. Therefore, since Q 3 is the set of points in R that stays in R for all input strings, we have
n T,, 'x.
Q~ =
11=o
Now we can express Q2 in terms of the sets Q?, A, and R. Since points in Q2 must be in A and are mapped into points in Q3 under 0-map, we have Q2 c A n (O-map)-'(Q3) The other condition on Q 2 is that all of its points must stay in A for all time under iteration of 1-map. If we define Bo = A n (l-map)-'(A), and B,, = A n (l-mapj-'(Bl,~l), then we have Q2 = A
n (O-map)-'(Q,)
Similarly, if we define CO= A (C,,-l), then we have
al = ~n
n 1 (,I, n
n
(l-map)-'(Qzj n
B,,
(0-map)-'(A), and C,, = A
n (0-map)-'
n c,,1 (]Io
The statement that the R" models the DFA is equivalent to saying that these sets are not empty. This process would be identical for the case with noise, except that rather than using the accept and reject regions, we would be using the subsets of those sets satisfying the condition that each point is at least f from the boundary of the appropriate set. For example, for a onedimensional RNN, if the accept region is (0.9, 1.01 and f = 0.02, then our modified set would be [0.92,0.98]. Kolen (1994) suggests that the problem that sensitive dependence on initial conditions leads to is that the NDFA interpretation of the RNN dynamics leads to a system description as either an NDFA with few states or a DFA with many states. This is a general problem with measuring and providing a computational description of dynamics at the onset of unpredictability as discussed in Crutchfield (1994a,b). We resolve this problem for R " s performing FSM computation by requiring that the extracted FSM be deterministic since it will be this DFA interpretation that will reflect the computation being performed by the RNN. In the DFA extraction case, any behavior that is a part of the computationZ2will correspond to a cycle in the minimal DFA diagram, so *'Since the Q, will not fill the state space when there are more than one of them, it is possible for the maps to have behaviors in the complement of the union of the Q,s that have nothing to do with the computation being performed
1170
Mike Casey
there is no problem with sensitivity to initial conditions causing states to split. If, based upon nondeterminism generated by a single chaotic attractor, one gives an NDFA interpretation of an R" performing FSM computation then the extracted NDFA will not reflect the computation that the RNN is performing.
Theorem 5.2. Stippose tlznt n n R N N rohirzfhy iizod~ls17 DFA, ;M, arid o m of its l-irinps lias a behai1ior tlimt is reaclinbir, tlieii the behni1ior corresporzds to n c!jcle in .U's tnir7imnl s f n t p dingrnnr.
For the single behavior of an I-map to represent a cycle with more than one state, its I-map would have to satisfy the condition that (I-map)" has I I distinct attractors for some i i > 1. Examples of such attractors13 are an attracting periodic orbit (as in our example RNN solution for L?) or an attractor where each point in an attracting periodic orbit has bifurcated into an invariant circle with a dense orbit (which one would find after a Hopf bifurcation of the ntli composition of the map with itself). Proof. If the behavior is reachable, then there is some point, x, in the behavior that is in an 6-pseudo-orbit of some point in c-map(Z") for some input string S. It follows that s is in Qi for some j , else the RNN will not model .cZ. Suppose further that there is no number of I strings that M can read in order to return to q, (i.e., q, is not in an n-cycle for the I-map). But, since s is in a behavior of the I-map, it is r-pseudoperiodic for the I-map, and hence there is an ~ r and i a pseudo-orbit of s such that the nitli iterate of s is .Y, and hence in Q,. It follows that if the RNN models M and .W starts in state ql, then after .M reads copies of I , .U will have returned to 4,. It follows that (1, is in an rrz-cycle of .W's minimal diagram, which is a contradiction. 0 The robustness assumption is not essential to the argument. It is sufficient that the attractor have a dense orbit. If we are not requiring robustness, then we also might want to consider more general invariant sets than attractors, and, again, a dense orbit is sufficient for this result. (xITI= 4x,( 1 -s,), for example, does not ha\Fe a chaotic attracting set, but rather has only a chaotic invariant set (with a dense orbit).] The examples that Kolen gives in Kolen (1994) are all maps with dense orbits in their invariant sets, and hence if they were a part of an RNN modeling a DFA, the invariant sets would correspond to cycles in the DFA diagram. Sensitivity to initial conditions itself does not pose any problem for DFA extraction. If the reachable points in a partition state are mapped to more than one other partition state and the RNN models some DFA, then the target partition states must be equivalent. "Since we are technically concerned mvith r-pseudoattractors rather than attractors, these examples are not technically valid, but if 6 is small enough, then the behaviors that will correspond t o these attractors in that case will be conceptually equivalent to the original attractors for the purpose o f this discussion.
Discrete-Time Computation
1171
6 Conclusions
One approach to understanding complex systems (both natural and artificial) involves carefully observing the behavior of the system and then attempting to infer system structure. This top-down tactic is fundamental (in particular) to cognitive science, neuroethology, and psychophysically constrained neuroscience. Our work is consistent in spirit with these approaches, since we start out assuming that we have a complete understanding of the task or behavior24 (the task in this case is to perform an FSM computation), and then we infer structure in the mediating system. If we further have access to the relevant internal variables, we can then use our theory to check to see whether or not the system actually has the hypothesized FSM behavior, which could not be conclusively demonstrated through observing the input/output behavior of the system. So not only can we use our theory to understand the inner workings of the system (on some level) from carefully observing the behavior, but we can also use it in conjunction with an understanding of the inner workings of the system to understand and predict input/output behaviors of which the system is capable that we cannot hope to observe (since there are uncountably many distinct inputs). The result of Theorem 3.1 is to show that if an RNN has FSM behavior then the RNN must necessarily partition its state space into disjoint regions that correspond to the states of the minimal DFA diagram. The way that the maps associated with giving the R" constant inputs will act on these regions in state space will similarly follow from the behavior and, as shown in Theorem 3.3, all cycles in the minimal DFA diagram will induce attractors for the corresponding maps. By Theorem 5.2, these will be the only attractors that enter into the computation. Regarding the imprecision following from the fact that many DFA will accept the same language, it is true that an RNN may organize its state space to explicitly model an unminimized DFA (this is what made comments on behaviors in the same class necessary), but the multiple regions of phase space will reduce (as will the multiple IP states of the unminimized DFA) to unique IP states of the unique (up to a state relabeling isomorphism) minimal DFA recognizing the language. Hence we now have an outline for using the top-down approach to understanding the workings of R " s that are producing behaviors that are equivalent to FSM computations. First, we gain a clear understanding of the problem to be solved by casting it in the form of a language recognition problem and by finding the minimal DFA that recognizes the language. (Of course, this is a very difficult problem in general.) Second, one uses the theory developed in this paper to analyze the DFA in a 24Thisassumption, of course, means that we first hypothesize the full task structure underlying an organism's behavior (which may be impossible to determine empirically without having access to all of the relevant internal variables), and then explore the complexity that each of the hypotheses would imply in the system.
1172
Mike Casev
manner that is suited to making connections between the problem and the dynamics of a familv of dynamic systems that can model the DFA. Next, one studies the mapping between hidden unit activations and the production of the desired output (tril-ial for the RNNs studied here, but more complex when one moves to Elman networks, for example). Finally, one measures the RNN t o make sure that it has the dynamics necessary for the cornputation. We feel that restricting our problem domain to modeling DFA does not severely limit the applicability of our theory, because many of the problems that RNNs have sol\vd can be recast as DFA problems. Examples such as modeling Reber grammars as in Cleermans ct 01. (1989),producing sequential behaviors as in Yamauchi and Beer (1994), robot navigation as in Tani (1995),and modeling a Turning machine as in Williams and Zipser (1989) are essentially DFA problems. Furthermore, as shown above, and also following the work of Siegelmann and Sontag (1992), DFA are the only class of computational models that can be robustly modeled by unaugmented RNNs. Another class o f problems that RNNs can solve is pattern production [see Hertz r ~ 01. t (1991), pp. 63-67], which has been studied by Tsung, Cottrell, and others (William and Zipser 1989; Tsung et i l l . 1990; Tsung and Cottrell 1993; Narendra and Parthasarathy 1990). It \Till be interesting to determine the existence of other categories of RNN computation behavior, and methods such as ours, possibly in conjunction with those described in Crutchfield and Young (1989b), may be useful for classifying other computational tasks in terms of the dynamics that are necessary for their production. The difficulty of learning a DFA from examples of desired behavior [see Giles ct n1. (1992)] serl’es as evidence that careful studies of behavior and useful classification o f problems solved by organisms are nontrivial aspects of the analogous approach to understanding computation in nervous systems. Once we find a task representation of the tasks performed by ”intelligent” organisms (the analogue of the DFA diagram) that enables one to infer internal dynamics from studying the representation, we should be able to predict the organization of nervous system behaviors (on some level) by understanding the mapping from nervous system beha\,ior to organism behavior, and by Understanding the logical structure of the problem to be solt.ed (nontrivial even in the case of DFA). One is then in a better position to measure nervous systems in order to find those aspects of nervous system behavior that are essential for computation as opposed to those essential for biological maintenance or some other function. The fact that more than one pattern of nervous system behavior may produce the same pattern of organism behavior [Kristan 1980, and Selverston and Moulins 1985 are examples] suggests that even with the added insight of knowing (on some level) what we seek, it is likely to still be nontrivial to map that onto actual nervous system functioning that is amenable to measurement. But there is always hope that
Discrete-Time Computation
1173
the nervous system has a small number of standard "tricks" or methods for solving the problems which it faces. 7 Appendix
Lemma 7.1. A behavior, A, of a continuous map,f of ZNinto itself, is an open set. Proof. Denote the closure of A by C. First we show that A is equal to B, the union of the f-pseudoperiodic points of period n for all n 2 1 if the map is restricted to C. A is contained in 8 since A is a subset of R e f )and B is equal to R , f ) whenf is restricted to C. To see that B is contained in A suppose that there is some point x E B that is not contained in A. Since 8 is contained in C and C is contained in the c-pseudobasin of attraction of A, and since x is not in A, all of x's 6-pseudo-orbits are eventually contained in A. But if x is in B, then x is c-pseudoperiodic for some n, and hence not all of its 6-pseudo-orbits are contained in A, which is a contradiction. It follows from these conclusions that A is equal to B, so it is left to show that B is open. B is the union of the c-pseudoperiodic points of period n for all n 2 1, so it will be open if every c-pseudoperiodic point has an open neighborhood containing only pseudoperiodic points. To see that this is true, let x be a point that is c-pseudoperiodic of period M , and x ={x. XI.. . . .x,,-l. x,,= x} be its c-pseudoperiodic orbit. By the continuity off, we may choose an open neighborhood, M, of x that is mapped within the c-ball about xl. So any point y E M has an c-pseudo-orbit of the form {y. XI... . . x,,-1. x,,= x}, so if y is within 6 of f(xl,-l), then y is f-pseudoperiodic. Let L denote the c-ball about f ( x , , - ~ )then , we have that x E L, so x E ( L n M ) . Hence, if we choose a y E ( L n M ) it will be c-pseudoperiodic. (L nM) is the open neighborhood of c-pseudoperiodic 0 points that we seek, hence B is open.
Lemma 7.2. Let f be a continuous function of ZNinto itself, and let B be an closed set of INwith nonempty interior and define T to be the set of all x E B such that no epseudo-orbit of x ever leaves B. Then the closure of T is a trapping region. Proof. To see that T is open, let x E T and let N be an c-neighborhood of f(x). Note that N is contained in B, since otherwise x could map out of B in one time step, which would contradict our assumption that x is in T , and N is in T since all points in N are in a possible c-pseudo-orbit of x. By continuity off, f-'(N) n B contains a neighborhood of x [since f-'(N) is open] andf-'(N)nB is in T sincefCf-'(N)nB) is contained in N, which is contained in T. Let C be the closure of T. To see thatf(C) is contained in T, and hence that C is a trapping region, first observe that if x E T, thenf(x) is in T
Mike Casey
1174
by the definition of T . Now suppose that .I-is in the boundary of C and let N be an 6-neighborhood o f f i s ) . Thenf- ’(N) n T is nonempty by the continuity off and the openness of T, so there exists a y E f ‘(N) n T. f ( , ~E/ )N, thereforefix) lies in an c-pseudo-orbit of y. H e n c e j l s ) is in T since y is in T . 0
Lemma 7.3 (The Wandering Lemma). Suppose some closed set B with nonempty interior contains no behaviors. Then for any c > 0 and for any point s E B there is some finite number N such that the Ntlz point in s ~ m r-pseudo-orbit e of s will leave B. Proof. Assume there exists an x E B such that no f-pseudo-orbit of x ever leaves B, then T (from the previous lemma) is nonempty, and therefore its closure is a trapping region, which is a contradiction since the existence of a trapping region implies the existence of at least one behavior.
Theorem 7.1. Asstrrw tlint f lie nccrpt aizd reject rqiews m e coiin~cted,tlzeiz a (vie i i i i i t RNN ivifh sipioidnl disc~iirrinniitfrrrlctioll cniirzot recoprize L?. Proof. It is sufficient to consider the noiseless case, since the RNN must recognize L3 independent of which of the RNN’s possible c-pseudo-orbits is followed, and the noiseless case is a possible pseudo-orbit. For subsets U and V of the inter\,al we say that U < V if and only if for all 11 E U and i t E V , i i i’. If A and R are connected, then it follows that A < R or R c A since they are disjoint by definition. We can assume without loss of generality that R < A (as in our examples). By Theorem 3.1 if an RNN recognizes L? then it must have sets Q, corresponding to the states [Ir of the minimal DFA which recognizes Li, shown in Figure 5. From this and the fact that R <. A it follows that Q: < Q I and Q <.iQ z . To see that either Q1 < Q 3 or Qz i QI, assume that there are points 91 and 11: in Q1 and 17: in Q2 such that qi < 17: <: qi. For strictly increasing maps ,f, if .Y I/ then f ! s l < f l y ) , and for strictly decreasing maps g, if r < y then ‘y!.xj > xl!/j. Since we are assuming that our RNN has sigmoidal discriminant functions, each i-map is either strictly increasing or strictly decreasing [they cannot be constant (other than the e-map) because they must move points in the accept region to other points in the accept region and also move points in the reject region to themselves]. It follows that if the 0-map is strictly increasing then O-map(ql) < O-map(q:) < O-map(qi). But O-map(q:) is in Qi and O-map(qi) is in Q2 and hence O-map(qi) > O-map(qf),which is a contradiction. If 0-map is strictly decreasing, then 0-map(qf) O-map(qi), but 0-map(qf)is in Q7 and O-map(q;) is in Qz, and hence O-map(q:) < O-map(q!,), which is a contradiction. Without loss of generality we can now assume that QI > Q2, since a symmetric argument can be made if we assume QZ > Q,. It follows from the assumption that Q I :> QZ that the 1-map is strictly decreasing since if we take qi E Q1 and 9: E Q2 then q: > qi, but l-map(q1) < l-map(q:) since I-map(q;) E Q3 and l-map(qf) E Q1 and Q3 < Q , . Now if we take
>.
Discrete-Time Computation
1175
some point q: E Q3, then since 1-map is strictly decreasing we will have that l-map(q:) < l-map(q:), but l-map(q:) is in Q1, which is in A, and l-map(qi) is in Q 3 , which is in R, so A > R implies that l-map(q:) > l-map(q:), which contradicts our assumption that Ql > Q2. Hence, the theorem is proved since our sets Q1 and Q2 cannot exist if A and R are connected and our R" has a sigmoidal discriminant function. 0
Lemma 7.4. Let U and V be disjoint compact subsets of IN, then if we divide INinto smaller N-dimensional cubes, called partition states, having sides of length 2"', there exists a finite m such that none of the partition states contains points in both U and V. Proof. Consider a sequence of points generated by increasing the partition parameter by powers of two and then choosing one point in U from any partition state that contains points in both U and V. If we can define an infinite sequence this way, then since U is compact our sequence will have an accumulation point. But, since the distance from points in the sequence to points in V goes to zero by the fact that they are chosen from partition states of vanishing size, that accumulation point is also an accumulation point of V , contradicting the assumption that U and V are disjoint. So we have that there are only a finite number of partition parameters that yield partition states containing points in both U and V.
c Acknowledgments I would like to thank Gary Cottrell, Kimberly Claffy, Steven Swerling, Orna Gil, Jim Crutchfield, and several anonymous reviewers for making helpful comments during the preparation of this work. I would like to thank Michael Freedman for his inspiration and guidance, and I would like to thank Arnold J. Mandell for his guidance and nurturing and for many useful discussions over the last several years. This work was supported by an NDSEG Graduate Fellowship, INCOR grant under Professor Michael Freedman and the San Diego Supercomputer Center. The author has also been supported in part by an Alfred P. Sloan Postdoctoral Fellowship in Computational Neuroscience.
References Arbib, M. A,, and Zeiger, H. P. 1969. On the relevance of abstract algebr'i t o control theory. Automatico 5, 589-606. Beer, R. D. 1996. On the dynamics of small continuous-time recurrent i i c i i r c > l networks. Adaptive Behavior 3(4), 471-511. Bowen, R. 1975. Eqiiilibriiim Stnfes nrtd the Ergodic Tlieor:1/ O ~ A I ~~ OI ~S ~O~ ~I ~I I I ~ I Lecture Notes in Mathematics, Vol. 470. Springer-Verlag,Berlin.
I ~ ~ I I J ~ ~ I ~
1176
Mike Casey
Casey, M. 1993. Computation dynamics in discrete-time recurrent neural networks. In UCSD’s Institute for Neural Computation‘s Procrcdirip qf the AJJr z 7 r d Rrsrf~rdiS y i i i p s i u i t i , June, 78-95. CaSev, M. 1995a. Rdosirlg f/lP S!/J12JIICfl./C ilY’lg/lt ( ~ O J 7 d ~ f ~ O J l f O ?COJlWl’gf’JJt ‘ rf!/JJfIiJlfCS iJJ discrete-tfitir J‘KilI’WJJt firtwirks. Institute for Neural Computation Tech. Rep. Series No. INC-9503, April. Casey, M. 1995b. Computation dynamics in discrete-time dynamical systems. Ph.D. Thesis, Department of Mathematics, UC, San Diego, March. Cleermans, A,, McClelland, J. L., and Servan-Schreiber, D. 1989. Finite state T 372-381. J~. automata and simple recurrent networks. N r w m / C ~ J ~1, Crutchfield, J. I? 1994a. Observing complexity and the complexity of observation. In Iiisiifr ~ r s i / Oirtsiile, s H. Atmanspacher, ed., pp. 235-272. SpringerVerlag, Berlin. Crutchfield, J. P. 1999b. The calculi of emergence: Computation, dynamics, and induction. P/iysk-il D, 75, 11-54. Crutchfield, J. P., and Young, K. 1989a. Computation at the onset of chaos. In € J Z ~ ? I I Conip/?xit!/. ~/, i l r f d tlic P/iysic-.stJf I ~ ? f i ~ ~ ~ i iW. l t k Zurek, ~~i, ed. AddisonWesley, Reading, MA. Crutchfield, J. I?, and Young, K. 1989b. Inferring statistical complexity. Pliys. Rrii. Iht. 63, 105-108. Cummins, F. 1993. Representation of temporal patterns in recurrent neural networks. Preprint, submitted to Tlic 15th A J J J J Coi!ferwircz I~~ tf tlir C o g i ~ i f i w sc/ciii” SClCiPf!/.
De\,aney, R. L. 1Y87. A I JI i i t r o d i r ~ f i c i i ito C/li70~lit)!/Jfi7JJ?iC-l?/ s!/.;tc’IiJh. AddisonWesley, New York. t S. Tech. Rep. Elman, J , L. 1989. R[’fJrvsi’Jlti7f/OJJn l l i f Str-JiCtlrr.1’ I l l c ~ O J J J J ( Y t i 0 J l l ~t JJOd?/ 8903, Center for Research in Language, Uniiwsity of California, San Diego, La Jolla, CA, August. Giles, C. L., Sun, C . Z., Chen, H. H., Lee, Y.C., and Chen, D. 19911. Higher order recurrent networks and grammatical inference. In ,~di1c7JlCf’si ~ Ni w r d /!!foriJif7fiOli ~ C C , S S ; J ISI/.;ft’Jlf5 <~ 2, D. s. Tcxiretzky, ed., p. 38. Morgan Kaufmann, San Mateo, CA. Gilts, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., a n d Lee, Y.C. 1992. Learning and extracting finite state automata with second-ordcr recurrent neural networks. h’eiird Coi~rp.4, 393405. Grebogi, C., Ott, E., and Yorke, J. A. 1987. Chaos, strange attractors, and fractal basin boundaries in nonlinear dynamics. Scicviic 238 632-638. Hertz, J., Krough, A,,and Plamer, R. G. 1991. /iitrotfirc-tici~ifo tlrr T/J(W!/ (if Ncirrril Cotiipicti7tioft. Addison-Wesley, Menlo Park, CA. Hopfield, J . J. 1982. Neural netHorks and physical . toms with emergent . U.S.A. 79, 2554-2558. collecti\,e computational abilities. Pros. Nrifl. A c ~ 7 t f S Kalman, R. E., Farb, p. L., and Arbib, M. A. 1969. Topic-s i f f M17t/Juiiotiin/ S!ptcrJrs TIitwy. McGra w -Hill, New Yor k. Kolen, J. F. 1994. Fool’s Gold: Extracting finite state machines from recurrent network dvnanlics. In Aifi’iii1ctJ.i I J I Neirrn/ ~J!forJm?fioli ~ r o m s i t i ~Systfviis y 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 501-508. Morgan Kaufmann, San Mateci, CA.
Discrete-Time Computation
1177
Kristan, W. B., Jr. 1980. Generation of rhythmic motor patterns. In Information Processing in the Nervous System, H. M. Pinsker and W. D. Willis, Jr., eds., pp. 241-261, Raven Press, New York. Kuan, C. M., Hornik, K., and White, H. 1994. A convergence result for learning in recurrent neural networks. Neiird Comp. 6, 420. Lewis, R. L., and Papadimitriou, C. H. 1981. Elements ofthe Theory ofComputation. Prentice-Hall, Englewood Cliffs, NJ. Milnor, J. 1985. On the concept of attractor. Commzin. Math. Phys. 99, 177-195. Munkres, J. R. 1975. Topology: A First Course. Prentice-Hall, Englewood Cliffs, NJ. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamical systems using neural networks. I€€€ Transact. Neural Netruorks 1, 427. Nerode, A. 1958. Linear automata transformations. Proc. A m . Math. Soc. 9, 541544. Omlin, C., and Giles, C. 1994. Constructing deterministic finite-state automata in sparse recurrent neural networks. I€€€ Int. Conf. Neural Net7uorks (ICNN'94) 1732-1737. Pollack J. B. 1991. The induction of dynamic recognizers. Mackine Learn. 7 227-252. Selverston, A. I., and Moulins, M. 1985. Oscillatory neural networks. Annu. Rev. Physiol. 47, 2948. Shub, M. 1987. Global Stability of Dynamical Systems. Springer-Verlag, Berlin. Siegelmann, H. T., and Sontag, E. D. 1992. "On the computational power of neural networks. Proc. Fiflh A C M Workshop Comp. Learning Theory, Pittsburgh, PA. Siegelmann, H. A., Sontag, E. D., and Giles, C. L. 1992. The complexity of language recognition by neural networks. In Algorithms, Softzuare, ArchitectiireInformation Processing 92, J. van Leeuwen, ed., Vol. 1, pp. 329-335. Elsevier Science, Amsterdam. Tani, J. 1995. Essentail dynamical structure in learnable autonomous robots. In Advances in Artificial Life, F. Moran, A. Moreno, J. J. Merelo, and P. Chacon, eds., Lecture Notes in Artificial Intelligence 929, pp. 721-732. SpringerVerlag, New York. Tino, P., Horne, B. G., and Giles, C. L. 1995a. Fixed points in two-neiiroiz discrete time recurrent netzuorks: Stability and bifircation considerations. Tech. Rep. UMIACS-TR-95-51 and CS-TR-3461, Institute for Advanced Computer Studies, University of Maryland, College Park, MD. Tino, l?, Horne, B. G., Giles, C. L., and Collingwood, P. C. 1995b. Finitestafeinachines and recurrent neural netzoorks-Am tornnta and dynamical systemsapproaches. Tech. Rep. UMIACS-TR-95-1 and CS-TR-3396, Institute for Advanced Computer Studies, University of Maryland, College Park, MD. Tsung, F-S., and Cottrell, G. W. 1993. Learning temporal signals in phase space. Int. Symp. Nonlinear Theory Its Appl., Hawaii. Tsung, F-S., Cottrell, G. W., and Selverston, A. I. 1990. Some experiments on learning stable network oscillations. International Joint Conference on Ne~irnI Networks, San Diego. IEEE, New York.
1178
Mike Casey
Turing, A. M. 1936. On computable numbers, with an application to the Entscheidungs problem. Pros. Loridori Mntli. Sac.. Ser. 2-42, 230-265. Wang, X. 1991. Period doublings to chaos in a simple neural network: An analytical proof. Cavi/ilrs S!/st. 5(4), -125441. Watrous, R. L., and Kuhn, G. M. Induction of finite-state languages using second-order recurrent networks. Nrirrnl Couip. 4, 406314. Wiggins, s. 1990. 11i t rod I I c.ti~iito Appl;d KofiliiiiY71’Dy1117ifi i i n l sI{s tcv f is nr i d c/i17os. Springer-Verlag, New Ywk. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually run1, 270-280. ning fully recurrent networks. N~irlt7lCCJ~H\J. Yamauchi, B., and Beer, R. D. 1YY4. Sequential beha\ior and learning in evolved d\.namical neural networks. A(fnpt.B t ~ l i i i i ~2(3), . 219-246.
This article has been cited by: 1. Hongjie Li, Dong Yue. 2010. Synchronization of Markovian jumping stochastic complex networks with distributed time delays and probabilistic interval discrete time-varying delays. Journal of Physics A: Mathematical and Theoretical 43:10, 105101. [CrossRef] 2. Yurong Liu, Zidong Wang, Xiaohui Liu. 2009. On Global Stability of Delayed BAM Stochastic Neural Networks with Markovian Switching. Neural Processing Letters 30:1, 19-35. [CrossRef] 3. Yurong Liu, Zidong Wang, Jinling Liang, Xiaohui Liu. 2009. Stability and Synchronization of Discrete-Time Markovian Jumping Neural Networks With Mixed Mode-Dependent Time Delays. IEEE Transactions on Neural Networks 20:7, 1102-1116. [CrossRef] 4. Yurong Liu, Zidong Wang, Xiaohui Liu. 2008. On delay-dependent robust exponential stability of stochastic neural networks with mixed time delays and Markovian switching. Nonlinear Dynamics 54:3, 199-212. [CrossRef] 5. Yurong Liu, Zidong Wang, Jinling Liang, Xiaohui Liu. 2008. Synchronization and State Estimation for Discrete-Time Complex Networks With Distributed Delays. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38:5, 1314-1325. [CrossRef] 6. Henrik Jacobsson. 2006. The Crystallizing Substochastic Sequential Machine Extractor: CrySSMExThe Crystallizing Substochastic Sequential Machine Extractor: CrySSMEx. Neural Computation 18:9, 2211-2255. [Abstract] [PDF] [PDF Plus] 7. Peter Tiňo , Ashely J. S. Mills . 2006. Learning Beyond Finite Memory in Recurrent Networks of Spiking NeuronsLearning Beyond Finite Memory in Recurrent Networks of Spiking Neurons. Neural Computation 18:3, 591-613. [Abstract] [PDF] [PDF Plus] 8. Henrik Jacobsson . 2005. Rule Extraction from Recurrent Neural Networks: ATaxonomy and ReviewRule Extraction from Recurrent Neural Networks: ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF] [PDF Plus] 9. A. Vahed, C. W. Omlin. 2004. A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural NetworksA Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks. Neural Computation 16:1, 59-71. [Abstract] [PDF] [PDF Plus] 10. P. Tino, M. Cernansky, L. Benuskova. 2004. Markovian Architectural Bias of Recurrent Neural Networks. IEEE Transactions on Neural Networks 15:1, 6-15. [CrossRef] 11. Jiří Šíma , Pekka Orponen . 2003. General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic ResultsGeneral-Purpose
Computation with Neural Networks: A Survey of Complexity Theoretic Results. Neural Computation 15:12, 2727-2778. [Abstract] [PDF] [PDF Plus] 12. Peter Tiňo , Barbara Hammer . 2003. Architectural Bias in Recurrent Neural Networks: Fractal AnalysisArchitectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract] [PDF] [PDF Plus] 13. Jiří Šíma , Pekka Orponen . 2003. Continuous-Time Symmetric Hopfield Nets Are Computationally UniversalContinuous-Time Symmetric Hopfield Nets Are Computationally Universal. Neural Computation 15:3, 693-733. [Abstract] [PDF] [PDF Plus] 14. Stephen José Hanson , Michiro Negishi . 2002. On the Emergence of Rules in Neural NetworksOn the Emergence of Rules in Neural Networks. Neural Computation 14:9, 2245-2268. [Abstract] [PDF] [PDF Plus] 15. Paul Rodriguez . 2001. Simple Recurrent Networks Learn Context-Free and Context-Sensitive Languages by CountingSimple Recurrent Networks Learn Context-Free and Context-Sensitive Languages by Counting. Neural Computation 13:9, 2093-2118. [Abstract] [PDF] [PDF Plus] 16. Peter Tiňo , Bill G. Horne , C. Lee Giles . 2001. Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks)Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414. [Abstract] [PDF] [PDF Plus] 17. F.A. Gers, E. Schmidhuber. 2001. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks 12:6, 1333-1340. [CrossRef] 18. Jiří Šíma , Pekka Orponen , Teemu Antti-Poika . 2000. On the Computational Complexity of Binary and Analog Symmetric Hopfield NetsOn the Computational Complexity of Binary and Analog Symmetric Hopfield Nets. Neural Computation 12:12, 2965-2989. [Abstract] [PDF] [PDF Plus] 19. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 20. S. Lawrence, C.L. Giles, S. Fong. 2000. Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 12:1, 126-140. [CrossRef] 21. Mark Steedman. 1999. Connectionist Sentence Processing in Perspective. Cognitive Science 23:4, 615-634. [CrossRef]
22. Wolfgang Maass , Eduardo D. Sontag . 1999. Analog Neural Nets with Gaussian or Other Common Noise Distributions Cannot Recognize Arbitrary Regular LanguagesAnalog Neural Nets with Gaussian or Other Common Noise Distributions Cannot Recognize Arbitrary Regular Languages. Neural Computation 11:3, 771-782. [Abstract] [PDF] [PDF Plus] 23. P. Tino, M. Koteles. 1999. Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks 10:2, 284-302. [CrossRef] 24. B. Apolloni, I. Zoppis. 1999. Sub-symbolically managing pieces of symbolical functions for sorting. IEEE Transactions on Neural Networks 10:5, 1099-1122. [CrossRef] 25. C.L. Giles, C.W. Omlin, K.K. Thornber. 1999. Equivalence in knowledge representation: automata, recurrent neural networks, and dynamical fuzzy systems. Proceedings of the IEEE 87:9, 1623-1640. [CrossRef] 26. Mike Casey . 1998. Correction to Proof That Recurrent Neural Networks Can Robustly Recognize Only Regular LanguagesCorrection to Proof That Recurrent Neural Networks Can Robustly Recognize Only Regular Languages. Neural Computation 10:5, 1067-1069. [Abstract] [PDF] [PDF Plus] 27. Wolfgang Maass , Pekka Orponen . 1998. On the Effect of Analog Noise in Discrete-Time Analog ComputationsOn the Effect of Analog Noise in Discrete-Time Analog Computations. Neural Computation 10:5, 1071-1095. [Abstract] [PDF] [PDF Plus] 28. M. Gori, M. Maggini, E. Martinelli, G. Soda. 1998. Inductive inference from noisy examples using the hybrid finite state filter. IEEE Transactions on Neural Networks 9:3, 571-575. [CrossRef]
Communicated by Garrison Cottrell
Using Bottlenecks in Feedforward Networks as a Dimension Reduction Technique: An Application to Optimization Tasks Janet Wiles* Paul Bakker* Adam Lyntont Michael Norrist Sean Parkinson+ Mark Staplest Alan Whiteside] *Departments of Computer Science and Psychology and Department of Computer Science, The University of Queensland, Queensland 4072, Australia
1 Introduction
For over 10 years now, connectionist networks have been applied to optimization tasks; most famously, to the Traveling Salesperson Problem (TSP; Hopfield and Tank 1985). The most popular approach is to express the task to be optimized in terms of an energy function, and then use network training algorithms to find weight configurations that minimize that function (Aarts and Stehouwer 1993). In this note, we investigate a novel application of connectionist networks to the TSP that exploits, instead, their ability to find compact, feature-preserving representations on hidden layers that have a restricted number of hidden units (Cottrell et al. 1987). In this formulation, the TSP can be conceptualized as a dimension reduction task: to find a path on an M-dimensional map of cities, one must reduce that map to a one-dimensional ordered list in which neighboring cities are listed close together. This list constitutes a path that visits every city exactly once. Here, a network with a one-unit bottleneck was trained on maps of cities to determine if suitable paths could be extracted from the featurepreserving representation formed on the bottleneck layer. Encouraging results are reported on 10- and 30-city maps. Neural Cornputatiotl 8, 1179-1183 (1996) @ 1996 Massachusetts Institute of Technology
J. Wiles et al.
1180
2 Architecture, Simulations and Results
-
An N-I-k-N network was applied to the task of reducing a map of N cities to a one-dimensional path. Input patterns were locally encoded city designators; each city i is represented by a vector of 0s with a 1 in the ith position. The target vectors were constructed from a matrix of intercity (Euclidean) distances. The target vector for the ith city contains the distances from city i to all other cities. With this construction, the position of cities in a map is given to the network implicitly through the matrix of distance values provided by the target vectors. Our assumption was that the network would construct an internal representation in which neighboring cities in the original map would be clustered together on the strength of the similarity of their distance vectors to all other cities. The clustering together of neighboring cities is one obvious prerequisite for finding a short path. The network’ was trained on a randomly generated (2-D) map of 10 cities and 30 cities. Twenty simulations were run on both sets of cities, for 20,000 epochs each. Paths were extracted from the network by ordering the cities in terms of the level of activation they elicited on the single hidden unit bottleneck. The mean path length found for 10 cities was 2.59, while for 30 cities it was 5.55’ (see Fig. I ) . 3 Discussion
~~
-. ..-
The results demonstrate that single-unit bottleneck architectures find reasonable paths on small numbers of cities [cf. the Held-Karp lower bound (Held and Karp 1970) for 10 and 30 cities is 2.42 and 4.19, respectively]. However, the results do not establish the viability of this approach for optimization problems per se. What the results do demonstrate is the effect of output similarity on internal structure: since the input representations were orthogonal, the structure in the internal layers received ~ I ( I guidance from the input vectors. The target structure alone was sufficient to induce significantly reduced internal representations two layers away from the outputs, and these reduced representations reflected the target similarity structure. It is often stated that ”similar inputs lead to similar outputs” but these results demonstrate that ”similar outputs lead to similar internal representations.”’ ’The number of units in the second hidden layer, k , set to -1 for the 10-city network, and 10 for the 30-city network. These values were chosen from successful pilot runs. ’This technique actually finds the H i z r t r i l t o r ~ i l i r /~~ i i t / ron all the cities, but this path is easily extendable to the Traveling Salesperson’s route by adding the arc between the last and first city. The path lengths reported include tht, length o f this final arc. 3We thank Car! Cottrell for suggesting this interpretation o f the results.
Bottlenecks in Feedforward Networks
1181
\
Figure 1: Typical paths found on randomly generated maps of 10 cities (left) and 30 cities (right). As a technique for finding reduced representations, advantages of the current approach include the following: The bottleneck network is guaranteed to return a legal path, no matter when computation is halted. This property follows from the fact that the hidden layer is naturally constrained to represent every city exactly once. A legal (but suboptimal) path is calculated in the first epoch of training, and this path is gradually refined by moving neighboring cities closer together in activation space. In contrast, the elastic net method (Durbin and Willshaw 1987) is not intended to return a legal path until the "relaxation" process is complete. The path initially traverses only a subset of clustered cities, and is gradually "stretched" out until all cities are included on the path. The original connectionist approach advocated by Hopfield and Tank (1985), on the other hand, considers illegal paths as part of its search, and is not constrained to return a legal path even in the limit. 0
0
Although the simulations reported here were only on two-dimensional city maps, this technique can be used to find paths for cities situated in a map of arbitrarily large dimensionality, without adjustment of the network architecture or training technique. The learning signal simply requires a matrix of intercity distances, and this can be extracted from any number of dimensions in an equivalent manner. The success of the network in reducing the city-map representations to one dimension demonstrates that the assumption by DeMers and Cottrell(l993) that an encoding layer must precede a bottleneck layer
J. Wiles et al.
1182
is incorrect: here, dimension reduction was successfully achieved without requiring encoding between the input and bottleneck layers. Some facets of this approach that suggest further work are as follows: In the current configuration, the method does not scale well. The size of the input and output layers increases linearly with the number of cities. However, experiments with alternative output layer configurations have shown promise. Similar results on 10 and 30 city tasks were found using just 2 output units (which encoded each city’s Cartesian coordinates on the 2-D city map). While good results were found on the 10- and 30-city maps, the paths found on larger city maps have a tendency to contain “zigzags.” This causes the network to return path lengths significantly longer than those found by the elastic net method (Durbin and Willshaw 1987) on maps of 50 and 100 cities. The appearance of zigzags is possible because no explicit constraint has been built in to minimize path length. This suggests a clear avenue for future refinement of this approach. Acknowledgments
We thank Raymond Lister for extensive discussions on the TSP and for access to his studies on N-2-N encoders, Yoshiro Miyata for use of his PlaNet simulator, and Garrison Cottrell for his thoughtful comments on earlier drafts. This research was financially supported by an APRA scholarship to Paul Bakker and an ARC grant to Janet Wiles. References Aarts, E. H. L., and Stehouwer, H. P. 1993. Neural networks and the travelling salesman problem. In Proceedings of the Itzterrzntiorznl Coifleretzce on Artifc-id Neural Net7clorks (ICANN’93)Amsterdnnr, ( S . Gielen and B. Kappen, eds., pp. 950-955. Springer-Verlag, Berlin. Cottrell, G. W., Munro, P., and Zipser, D. 1987. Learning internal representations from gray-scale images: An example of extensional programming. In The Niizfh Aiiiitinl Cot7ferer7ce q f f l i cCogrzitizv Scierzce Socirty, pp. 462473. Lawrence Erlbaum, Hillsdale, NJ. DeMers, D., and Cottrell, G. 1993. Non-linear dimensionality reduction. In Adzwices in NPirrd Irlfortnntioti Processing Systeltzs 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 580-587. Morgan Kaufmann, San Mateo, CA. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nntnrt. Ilotidori) 326,689-691. Held, M., and Karp, R. M. 1970. The traveling-salesman problem and minimum spanning trees: Part 1. Operntions Res. 18, 1138-1162.
Bottlenecks in Feedforward Networks
1183
Hopfield, J. J., and Tank, D. W. 1985. 'Neural' computation of decisions in optimization problems. B i d . Cybern. 52, 141-152.
Received December 13, 1993; accepted January 22, 1996.
This article has been cited by: 1. Jialie Shen, John Shepherd, Anne H. H. Ngu. 2006. Towards Effective Content-Based Music Retrieval With Multiple Acoustic Feature Combination. IEEE Transactions on Multimedia 8:6, 1179-1189. [CrossRef]
Communicated by William Bialek
Temporal Precision of Spike Trains in Extrastriate Cortex of the Behaving Macaque Monkey Wyeth Bair* Christof Koch Computation and Neural S y s t e m Program, California lnstitiite of Technolog!/, Pasadena, C A 91125 U S A
How reliably do action potentials in cortical neurons encode information about a visual stimulus? Most physiological studies do not weigh the occurrences of particular action potentials as significant but treat them only as reflections of average neuronal excitation. We report that single neurons recorded in a previous study by Newsome et al. (1989; see also Britten e t al. 1992) from cortical area MT in the behaving monkey respond to dynamic and unpredictable motion stimuli with a markedly reproducible temporal modulation that is precise to a few milliseconds. This temporal modulation is stimulus dependent, being present for highly dynamic random motion but absent when the stimulus translates rigidly. 1 Introduction
Because the mean firing frequency of a neuron, typically averaged over a fraction of a second or more, in response to a visual stimulus is reproducible under identical stimulus conditions and varies predictably with stimulus parameters, it is widely held to be the primary variable relating neuronal response to visual experience (Adrian 1928; Lettvin et al. 1959; Werner and Mountcastle 1963; Barlow 1972; Henry et al. 1973). Accordingly, many studies hold a stimulus parameter constant during an experiment, measure large variations in firing frequency across different trials and high within-trial variation in interspike intervals, and conclude that the microstructure of spike trains is essentially random (Schiller et al. 1976; Heggelund and Albus 1978; Tolhurst et al. 1981, 1983; Vogels et al. 1989; Softky and Koch 1993; Shadlen and Newsome 1994). A few studies have emphasized that cells in mammalian visual cortex responding to moving patterns show stimulus-locked temporal modulation, sometimes referred to as "grain" response (Tomko and Crapper 1974; Hammond and MacKay 1977; Gulyas et al. 1987; Snowden et al. 1992). However, 'To whom all correspondence should be addressed, at: Howard Hughes Medical Institute and Center for Neural Science, New York University, 4 Washington Place, Rm. 809, New York, NY 10003. Neural Computation 8, 1185-1202 (1996) @ 1996 Massachusetts Institute of Technology
1186
Wyeth Bair and Christof Koch
the time scale and stimulus dependency of this type of modulation have not been characterized at the trial-to-trial level. Stimulus-locked modulation has also been shown to exist in visual cortex for static patterns (Richmond et 01. 1987, Richmond r t al. 1990). Studies that have explicitly addressed the temporal frequency profiles of LGN and visual cortical neurons (Derrington and Lennie 1984; Foster et al. 1985; Lee et 01. 1989; Levitt rt al. 1994)have typically used drifting sinusoidal gratings-stimuli that rarely induce temporal modulation in the output of MT cells (Sclar r't 01. 1990; J. Anthony Movshon, personal communication). We analyzed data from an earlier series of experiments by Newsome rf nl. (1989; see also Britten ct 01. 1992) that linked the response of wellisolated single units in extrastriate area MT to the psychophysical performance of macaque monkeys. We show that the dynamic random dot stimuli employed in these studies, largely composed of spatially and temporally broad-band noise (Britten r't nl. 1993), can produce highly modulated responses in MT neurons, and we present a brief characterization of the neuronal response in terms of temporal precision, reliability, and power spectra. The present study makes use of data that were collected many years ago, and the analysis is limited in some ways because the randomization seeds for the stimuli (see Methods) were not stored. Random dot stimuli have been used for the study of motion perception since the early 1970s, and yet even recent electrophysiological studies of the response of MT neurons to random dot stimuli (Snowden ct al. 1992; Britten ct 01. 1993) have not found or not emphasized the temporal aspects of the neuronal response. We therefore think it is useful to point out the temporal properties of the responses to a random dot stimulus as an aid for the development of further experiments that could make use of reverse correlation or stimulus reconstruction techniques (McLean and Palmer 1989; Bialek d 171. 1991). 2 Methods
Details of the stimulus generation and experimental paradigm are summarized here. For a full account of the electrophysiological methods, see Britten ct al. (1992). 2.1 Experimental Procedures. The dynamic dot stimulus consisted of 0.1' diameter dots plotted asynchronously for 2 sec at a density of 16.7 dots/degree*/sec on a large-screen CRT monitor (Hewlett-Packard 13218 or XYtron A21-63; P4 phosphor, 0.2 cd/m2 mean luminance). The dots were illuminated for 150 Irsec. The diameter of the circular aperture (between 5" and 1s') in which the dots appeared was optimized for the receptive field of the neuron, so neurons having smaller receptive fields were stimulated with fewer total dots. A coherent motion signal was
Spiking Precision in Visual Cortex
1187
introduced in the display by altering the probability c that a given dot would be displayed with a particular spatial and temporal offset relative to a previous dot. The temporal offset was 45 msec, and the spatial offset was adjusted to match the velocity preference of the neuron (or to oppose it by 180" for null direction trials). The probability c is referred to as the "motion coherence level." The probability that a sequence of n dots would be generated that carried the motion signal was c(n - 1). The coherence is varied from 0 to 1. At c = 0, all dots are plotted randomly, while at c = 1 the display appears to be a rigid sparse dot pattern that translates at the neuron's preferred speed and direction. The exact pattern of dots in a particular stimulus was determined by the seed value used to initialize the random number generator. The seed was occasionally set to a predetermined constant so that many responses to the same dot pattern could be recorded, but identical stimuli were always interleaved with stimuli having other seeds, at other coherence levels, or moving in the other direction. All trials having a constant seed were grouped together and analyzed, but the values of the constant seeds were not stored, precluding stimulus reconstruction or the use of reverse correlation methods. During the 2 sec stimulus presentation, occurrence times of action potentials were recorded with a 1 msec resolution. In the worst case, this discretization would add 0.5 msec to our standard deviation measure of temporal jitter (see Section 3.1 below). Other sources of error in the recorded time of action potentials relative to stimulus onset are small compared to the quantization error. The monkey maintained fixation during the stimulus (eye movements were < 0.5") and indicated the direction of motion with an eye movement after the stimulus was extinguished. 2.2 Data Analysis. We analyzed 54 cells in three monkeys: E ( n = 26), J ( n = 9), and W ( n = 19). Not all cells were recorded under all experimental conditions, so the number of cells involved in each analysis will be stated in the text. Cells with mean responses that changed by more than 100% during recording sessions were not included. In all computations, the post-stimulus time histogram (PSTH) for a set of spike trains is the average number of action potentials as a function of time relative to stimulus onset and is computed at the same millisecond resolution as the original recordings-the data are not smoothed. The PSTHs shown in the figures, however, have been smoothed using an adaptive square window that is widened to include a criterion number of spikes. A onesided estimate of the power spectral density of the PSTH is computed using the standard fast Fourier transform (FFT) algorithm and overlapping data segments with windowing (Press et al. 1988). To avoid biasing our statistics with the initial transient response, we restrict our analysis, except where noted, to the 1600 msec "sustained" portion of the spike trains that follows the 400 msec "transient" period beginning at stimulus
Wyeth Bair and Christof Koch
1188
onset, t = 0. Bursts of action potentials (consecutive spikes occurring with interspike intervals of 3 msec or less) can create excessive power at low frequencies for some neurons and therefore are replaced by single action potentials using the technique described by Bair ut al. (1994). This analysis focuses on properties of sets of spike trains that were recorded from single neurons using the same dynamic dot sequence. For illustration and as a control, sets of responses for randomly seeded stimuli, but with all other stimulus parameters held constant, are analyzed and referred to as "control" data. The control data should not show signs of stimulus-locked temporal modulation in the PSTH because responses to different stimuli are averaged together. However, the control data provide a null hypothesis that is better than a Poisson assumption because known deviations from Poisson temporal structure, e.g., refractory periods, bursts, and drifts in excitability, will be present. A simple test for a violation of Poisson statistics was used to determine the presence of stimulus-locked modulation in a set of spike trains. The a\.erage firing rate was determined for the 1600 msec sustained period and taken to be the mean rate of a homogeneous Poisson point process. If the observed firing rate in any segment of the sustained period was improbably high at the significance level, the response was considered to have stimulus-locked modulation. The stringent significance level reflects the inadequacy of the Poisson process to account for the refractory period, burst firing, and nonstationarities that are frequently found in spike trains. Of the 54 cells, 49 were found to respond to a dynamic dot sequence with significant temporal modulation based on 30 trials at c0. The test for significant modulation, when run on 54 sets of control data taken from the same number o f cells in the same animals, yielded two false positives. Visual inspection of these false positives revealed that the spike rate changed slowly over the murse of the trial and that this trend was independent of the particular dot pattern. -7
3 Results
_ .
-~
A neuron presented with 90 different random dot stimuli at L' = 0 produced an ensemble of responses (Fig. I, left) that, except for the initial transient, can be approximated by a point process with a time-invariant mean rate, such as a homogeneous Poisson process modified by a refractory period (Bair r t al. 1994). The right side of Figure 1, showing 90 responses of the same cell to one particrrlnr c = 0 random-dot stimulus, reveals that the neuron's firing pattern was often very tightly locked to the stimulus. Thus, much of the apparent randomness of the ensemble on the left was caused by the fact that a different random dot pattern was presented on each trial, although some periods of the response remain unstructured. The firing pattern on the right can be modeled to first order by a random process with a time varying mean rate, such as
Spiking Precision in Visual Cortex
1189
an inhomogeneous Poisson process. The time-varying modulation, estimated by the PSTH (bottom right), is characterized by narrow peaks, often produced by single action potentials occurring at precise instants across trials. A second neuron responded to a dynamic dot sequence at c = 0 with a much higher firing rate but was still highly modulated (Fig. 2). After characterizing the temporal modulation in the time and frequency domains, we will contrast the patterned responses in Figure 1 and Figure 2 with the case of c = 1 stimulation, i.e., coherent motion, in which modulation is not present when an identical stimulus is repeated. 3.1 Precision and Reliability. For the 49 cells which had statistically significant stimulus-locked modulation (see Section 2.2), we quantified the temporal precision of the spike trains using the standard deviation (SD) in time of the onset of periods of elevated firing such as those indicated by peaks in the PSTH at the bottom right of Figure 1. This technique is similar to that used by Sestokas and Lehmkuhle (1986). The standard deviation measure will be referred to as temporal "jitter" and has a small value for a precise response. A peak in the PSTH corresponding to a period of significantly elevated firing probability (see Section 2.2) was accepted as well enough isolated for analysis if an arbitrary point in time preceding the peak existed such that the mean time to the first spike in the response was greater than twice the standard deviation of the distribution of first spike times. For example, one statistically significant peak is marked by a thick line near 1740 msec in the PSTH at the bottom right of Figure 1. Considering a period of 70 msec surrounding that peak, we measure the time from the beginning of the period to the first spike on each trial. The distribution of first spike times is shown in Figure 3A. The SD, or jitter, is 3.3 msec and yet no action potential occurs for at least 25 msec prior to the response. The distribution achieved by this method is different from the shape of the peak in the PSTH, which includes all the spikes, not just the first one in a response. In addition, by considering only the first spike in an isolated response period on each trial, our measurement is less likely to be biased by a refractory period or the interspike interval statistics for the neuron. For a few neurons, such as the one in Figure 2, the significant peaks in the PSTH were not well enough isolated to perform the jitter analysis for individual action potentials. In those cases, we searched for the first occurrence of a pair of spikes with less than a specified interspike interval (6 msec in Fig. 2). The distribution of the occurrence of these pairs is shown in Figure 3B. Reliability was measured as the probability that a response occurred during the periods described above. For the response period analyzed in Figure 3A, the cell responded (fired at least one action potential) on 24% of trials (see Fig. 1, right, for spike trains). The same cell had other, more reliable responses-the reliability for the peak near 1000 msec was
Wyeth Bair and Christof Koch
1190
0
msec
msec
Figure 1: The neuronal response of one cell in area MT in a behaving macaque monkey to randomly seeded dynamic dot stimuli at c = 0 presented for 2 sec (left) appears to be well described by a point process with a mean rate of 3.4 Hz (excluding the initial transient). However, when a dynamic dot stimulus formed with a particular seed was repeated (but interleaved with different stimuli) many times, the reliability of the response became apparent (right). Viewing this figure from an acute angle reveals the precision of the pattern; for example, nearly all spikes in the final 400 msec of the response cluster in six vertical streaks. Below each set of spike trains is a poststimuius time histogram (PSTH) computed from each set of 90 trials using an adaptive square window centered at each point and widened to capture 10 spikes. The set of data on the left is referred to as control data.
0.84. Within the response period marked on the PSTH for the neuron in Figure 2, two spikes with interspike interval 5 6 msec occurred on 95% of the trials. The distributions shown in Figure 3A a n d B correspond to the
Spiking Precision in Visual Cortex
1191
Figure 2: The neuronal response of another MT cell to c = 0 dynamic dot stimulation on 206 trials using a particular stimulus pattern. This neuron (e093) produces a highly modulated response, like the neuron in Figure 1, but has a much higher firing rate (113 Hz, SD 15 Hz, 206 trials). While the responses of most cells contained occasional clearly isolated epochs of elevated firing rate, this cell never dips down to its background firing rate (2 Hz). The temporal precision of the response peak indicated by the thick bar below the PSTH is shown in Figure 3, and the power spectra for both PSTHs are shown in Figure 4. The lower, flat PSTH corresponds to 210 trials of control data. The adaptive window used to smooth the PSTHs is widened to include 40 spikes.
most precise responses for the spike trains shown in Figure 1 (right) and Figure 2. For all 49 cells, the scatter plot in Figure 3C shows reliability versus jitter for the most precise response during the sustained period (filled circles) and for the initial transient, present in only 32 of 49 cells (crosses). For 80% of cells, the most precise response during the sustained period had jitter less than 10 msec and in some cases the jitter was as
Wyeth Bair and Christof Koch
1192
4..
1710
1780 a
. I570
1640
Time (msec)
0
5
10
15
Jitter (SD, msec)
Figure 3: MT neurons are temporally precise on the order of milliseconds. We mcasured precision as the standard deviation (SD), or jitter, of the beginning or ending time of a period of elevated firing, e.g., the periods indicated by the thick lines below the PSTHs at 1750 insec in Figure 1 and near 1600 msec in 1:igure 2. (A) The distribution across trials of the occurrence time for the first x t i o n potential in the response during the period 1710-1780 msec in Figure 1. ( B ) The distribution of the occurrence time of the first pair of action potentials fired within 6 msec in the time period 1570-1640 msec in Figure 2. (C) The jitter for the most precise response periods is plotted against the response reliability (probability) for 19 cells (solid circles). In 80"cjof cells, the minimum temporal jitter during the sustained portion of the response \\.as less than 10 msec, with the smallest values near 2-3 msec. For comparison, crosses indicate the jitter of initial transients (present in only 32 of 19 cells). At least 30 repeated trials were iistd for each cell. Four points exceeded the horizontal scale, with jitter \ d u e s of 16, 19, 27, and 69 msec.
small as 2 4 msec. The initial transients, when present, typically had less jitter than the most precise sustained period response. 3.2 Frequency Profile. The previous analysis focused only on the most precise period of the responses in the time domain, but we now examine the entire sustained period response in the frequency domain. Temporal frequency profiles of the responses of the MT cells were computed as the power spectra of the PSTHs for c = 0 stimuli. Spectra are shown in Figure 4 for the PSTHs in Figure 1 (right) and Figure 2. The power spectra are consistent with the notion that the cells act as low pass filters for the white noise dot stimulus; however, it is important that these profiles are not mistaken for temporal frequency tuning curves (see
Spiking Precision in Visual Cortex
1193
1 10
!
cz
0.1 1
0
2
10
30 60 120
2
480
10
Frequency (Hz)
30 60 120
480
Frequency (Hz) Monkey E n=22
0
30
60
90
120
150
Cutoff Frequency (Hz)
Figure 4: Temporal frequency cutoffs for area MT cells in response to dynamic random dots. The upper panels show the power spectra of the PSTHs for the neuronal responses shown in Figure 1 (right) and Figure 2. We defined the cutoff frequency for each cell as the lowest frequency at which the control spectrum intersects the response spectrum. The lower panel shows the distribution of frequency cutoffs for 22 cells from monkey E. (Monkey E was the only animal for which control data were available, see text.)
Discussion). There were no systematic peaks in the spectra at particular frequencies; for example, there was no stimulus refresh artifact as the dots were plotted asynchronously. To compute an upper cutoff frequency for individual cells, we compared the power spectrum for a particuIar stimulus pattern to that for a PSTH computed from control data in which all trials resulted from different c = 0 stimuli. The cutoff frequency was taken to be the lowest frequency at which the control power spectrum intersected the response power spectrum. The histogram of cutoff frequencies for 22 cells from monkey E reveals a range of values from 0 to 150 Hz (mean f standard
1194
Wyeth Bair and Christof Koch
deviation = 58 f 38 Hz, Fig. 4, bottom). Response data and control data from the sanze cell were available only for monkey E, but data recorded for individual coheverzt (c = 1) motion stimuli for J and W served as a control because, as reported in the next section, modulation was virtually absent for coherent motion. The distribution of cutoff frequencies for nine cells from monkey J had a mean (46f10 Hz) that was not significantly different from that for E (t test, p = 0.19), while the mean (23112 Hz) for six cells from W was significantly lower than that for both E and J ( t test, 1.7 .< 0.005). The distribution of cutoff frequencies shown in Figure 4 is consistent with our analysis of data from other experiments, using a different display, in which 73% of cells (22 of 30) in monkey J showed a peak in their power spectrum at 40 Hz when a coherent dot stimulus (c = 1) was presented in frames at 40 Hz. In an experiment using moving bars, 24%) of cells (12 of 49) in a fourth monkey (R) showed peaks in their spectra at 60 Hz when the bar moved on a 60 Hz frame-refresh monitor. Because the autocorrelation of a function is the Fourier transform of its power spectrum, these results may be interpreted in the time domain from the autocorrelation of the PSTH. The autocorrelations (computed after subtracting the means from the PSTHs) displayed a single peak before falling to zero with a width at half-height of 36 i20 msec (means for indi\idual animals were: E 25 i9.2, J 33 f 16, W 50 24). Note that both the autocorrelation function and the power spectrum are computed from the PSTH, and this is different from computing these functions for individual trials and averaging afterward (as done by Bair rf d.1994).
*
3.3 Response to Coherent Motion. The presence of temporal modulation depended on the motion coherence of the stimulus. While it was apparent for low coherence stimuli as shown in Figures 1 and 2, it was absent for highly coherent motion, i.c7., c = 1 (Fig. 5). We defined a measure M of the overall modulation strength based on the power spectrum of the PSTH. M was the integral of the power spectrum in the 4-30 Hz band divided by the mean spike rate across the PSTH. As shown in Figure 6 and by the range of cutoff frequencies in Figure 4, temporal modulation showed up as excessive contributions to this frequency band. In monkeys J and W, repeated stimulation using a particular dot pattern was performed at higher coherence levels, c = 0.5 and 1.0, in addition to c =: 0. In both animals, the modulation strength M was not significantly different at c = 0 versus c = 0.5 ( p > 0.20, paired I test). Yet, for both animals, M was significantly less at c = 1 compared to c = 0.5 (statistical significance: p < 0.005 for monkey J, p < 0.05 for monkey W).' 'Becatise c = 1 . 0 stimulation was not used for all cells; these computations were based on nine cells in monkey J and six cells in monkey W. In all computations, the ai'ernge M computed for control data from o t k r cells within the same animal was subtracted.
Spiking Precision in Visual Cortex
0
500
1195
loo0
1500
2000
Time (msec)
Figure 5: Temporal modulation disappears for highly coherent stimuli. The spike trains and PSTHs demonstrate that the stimulus-locked temporal modulation present for incoherent motion (c = 0) and for partially coherent motion (c = 0.5) was virtually absent during the sustained period of the response to coherent motion (c = 1). This suggests that temporal dynamics of a higher order than those found in rigid translation are necessary to induce a specific and unique time course in the spike discharge pattern.
We note that M was not significantly correlated with spike rate ( r = 0.09, p = 0.50, M vs. log of spike rate) nor with the diameter of the stimulus aperture, which was optimized for the receptive field of each cell ( r = 0.15, p = 0.31).
1196
Wyeth Bair and Christof Koch
0.
4
20
100
500
Frequency (Hz) Figure 6: Power spectra of the PSTHs in Figure 5. The temporal modulation, reflected by excess power at frequencies below 60 Hz, was similar for c = 0 and c = 0.5 but was absent for coherent motion. Curves for values between c = 0 and c = 0.5 (not shown) are similar to those shown for c = 0 and c = 0.5. Rarely were data recorded at values between c = 0.5 and c = 1.0, so intermediate curves cannot be shown. The PSTHs were expressed as instantaneous firing probabilities and the resulting power spectra were normalized by the mean firing rate. Our measure M of temporal modulation was the integral of the scaled spectrum in the 4-30 Hz band. Fifty-nine spike trains were included at each c. The first 400 msec of the PSTHs (shown in Fig. 5) was excluded to eliminate the initial transient response, which would otherwise contribute power in the 4-30 Hz band for all c. 4 Discussion
We have observed that a dynamic dot stimulus can produce periods of precise modulation in neurons in area MT. While it is common practice to seek the stimulus that causes the ”largest activity” (Lettvin et al. 1959), we have sought those periods in which the stimulus caused the most precise activity in order to estimate the temporal precision with which the cortical network can trigger an action potential within a neuron. Using the standard deviation measure of temporal jitter, we found that 80% of
Spiking Precision in Visual Cortex
1197
cells were capable of responding with jitter less than 10 msec, and the most precise responses had jitter less than 2 msec. The reliability of the cell, its probability of contributing to a peak in the PSTH on a particular trial, varied widely from 0 to 1 (Fig. 3 ) . The output frequency profiles of the averaged responses of MT cells (Fig. 4) are all low-pass with a broad range of cutoffs, some above 100 Hz, while the visual input provided by the incoherent dynamic dot stimulus has a flat temporal frequency spectrum.2 We emphasize that the frequency profiles of the responses reported here should not be interpreted as temporal frequency tuning curves. We do not know, for example, whether power near 50 Hz in the response is caused by 50 Hz components in the stimulus or arises from a computation such as squaring ~ it is not unreasonor rectifying a 25 Hz input c ~ m p o n e n t .However, able to assume that area MT processes information that has frequency components in the 100s of Hz since apparent motion can be perceived for temporal separations of less than 5 msec (Baker and Braddick 1985). Even under the assumption that all inputs to area MT have relatively low cutoff frequencies, high-frequency signals may still be reconstructed from the spatial distributions of the inputs, in analogy to hyperacuity. But there is no lack of evidence that precise and high-frequency responses are carried by single neurons earlier on. Fast initial transients (having 50100 Hz oscillations) are observed in magnocellularly derived responses in V1 (Maunsell and Gibson 1992) and, in the cat, retinal ganglion cells can lock their output to 100 Hz flicker (Eysel and Burandt 1984). The stimulus-locked temporal modulation that accompanies incoherent and partially coherent motion is not present for coherent motion, even though the same dots are flashed at the same time and location from one presentation of the coherent stimulus to the next and in a manner such that only a single moving dot is likely to traverse a V1 subunit of an MT receptive field at any time (within 50 msec or longer). In terms of the output, spatial inhomogeneities in the MT receptive field (so called "hot spots"), if they exist, are not apparent for this type of rigid pattern translation. The lack of hot spots is consistent with our models of MT receptive fields (see below). We do not believe that the lack of modulation for c = 1 is the result of saturation of the neuron under study because narrow peaks exist in cross-correlograms between pairs of simultaneously recorded neurons at c = 1 (Ehud Zohary, unpublished observations). Snowden ef al. (1992), using a smaller, much denser stimulus (3" aperture), estimated that 90% of MT cells did not modulate to a moving random dot stimulus; however, because the motion coherence of their stimulus is approximately equivalent to that of a c = 0.96 stimulus here, their results are consistent with the lack of modulation 21n practice, the approximate 150 psec lifetime of the dots is so short that the departure from a flat temporal spectrum is not significant to the monkey visual system. 'Squaring a sine wave causes frequency doubling, and rectification introduces even higher harmonics.
1198
Wyeth Bair and Christof Koch
that we find at c = 1.0. For coherent motion in the neuron's preferred direction, the neuronal response seems to lack precise temporal structure, and only the mean spike count is reproducible. In this case, the timing of individual action potentials may be governed by noise, but if more careful experimentation reveals that the spikes are precisely locked to internal fluctuations in the cortical network, we may realize, as Barlow (1972) observed for individual nerve cells, that "their apparently erratic behavior was caused by our ignorance, not the neuron's incompetence." The data reported here have some analogy to that recorded in other studies that have been successful in estimating how much signal is carried in seemingly variable neuronal discharge. In the work of Bialek et al. (1991),a random motion stimulus also produced stimulus-locked temporal modulation with precision on the order of milliseconds (although their analysis was not limited to this time scale). In subsequent experiments in primary afferents of the cricket cercus and of the bullfrog sacculus, it was found that at least half of the entropy in the spike trains carried information regarding the stimulus (Rieke et al. 1993). Their analysis relied on estimates of the Iimit of timing precision of an action potential, which they found to be 0.4 msec in the cricket and 2 msec in the frog. It will be interesting to determine whether the precision of the responses reported here will yield similar results for visual cortical neurons and how coding efficiency will vary across different visual stimuli and visual areas. Of course, these computations require knowledge of the time-varying input signal, which is not a\,ailable for the current data set. In other work (Bair 1995) we examined how the temporal modulation reported here compares to the output of a motion energy model operating on the dynamic dot stimuli. Using a model that spatially integrates the output of opponent motion energy units (Adelson and Bergen 1985) across an area the size of an MT receptive field (5-10" diameter), it is possible to achieve modulation with precision and power spectra similar to that shown here in Figures 3 and 4 for incoherent, i t . , c < 1, stimuli (Bair 1995). The model also accounts for the lack of modulation in response to c = 1 stimuli. The lack of modulation in the model output is not the result of saturation but is a consequence of integration over an even and sufficiently dense (as specified by the sampling theorem) grid of motion energy units. Although the statistics of a motion energy based model can be made to match those of the data reported here, it is important to consider that the model was constructed of noiseless linear filters and perfect multiplications and additions, with the only stochastic element being the generation of Poisson impulses from the analog output at the final stage of the model. Therefore, we believe it remains a challenge to theories of cortical processing (Stevens 1994) to explain how the observed low probabilities of synaptic transmission in cortical brain slices (Smetters and Nelson 1993; Thomson ef a / . 1993; Allen and Stevens 1994), combined with single synaptic contacts among pyramidal cells in mammalian cor-
Spiking Precision in Visual Cortex
1199
tex (Freund et al. 1985; Andersen 1990; Gulyas et al. 1993), can give rise to these highly reproducible spike patterns in cells roughly 7 to 8 synapses remote from the sensory periphery over a 2-hr-long experiment in a behaving animal (see also Lestienne and Strehler 1987; Abeles et al. 1993). Mainen and Sejnowski (1995) report that sustained current injection into cells in neocortical slice leads to a variable spike response, while repeated stimulation with a particular white noise current evokes a reliable spike pattern. Their findings in cortical cells suggest that the spike triggering mechanism itself is capable of accurately encoding temporally modulated input into spike trains, possibly providing the biophysical substrate of our results. It remains to be determined to what extent the rapid temporal modulation reported here carries any detailed information of behavioral significance.
Acknowledgments We thank William T. Newsome for kindly providing data collected in his laboratory and for extensive critical discussion that has shaped the course of this analysis. Kenneth H. Britten, Michael N. Shadlen, Ehud Zohary, and J. Anthony Movshon have contributed greatly through critical discussion and by sharing data that they have collected. This work was funded by the Office of Naval Research and the Air Force Office of Scientific Research. W. B. was supported by the L. A. Hanson Foundation.
References Abeles, M., Bergman, H., Margalit, E., and Vaadia, E. 1993. Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol. 70; 1629-1638. Adelson, E. H., and Bergen, J. R. 1985. Spatiotemporal energy models for the perception of motion. J. Opt. SOC.Am. A. 2, 284-299. Adrian, E. 1928. The Basis of Sensation: The Action of Sense Organs. Christophers, London. Allen, C., and Stevens, C. F. 1994. An evaluation of causes for unreliability of synaptic transmission. Proc. Natl. Acad. Sci. U.S.A. 91, 10380-10383. Andersen, P. 1990. Synaptic integration in hippocampal CA1 pyramids. Pros Brain Res. 83, 215-222. Bair, W. 1995. Analysis of temporal structure in spike trains of visual cortical area MT. Ph.D. Dissertation, Department of Computation and Neural Systems, California Institute of Technology, Pasadena, CA. Bair, W., Koch, C., Nesome, W. T., and Britten, K. H. 1994. Power spectrum analysis of bursting cells in area MT in the behaving monkey. J. Neiirosci. 14, 2870-2892. Baker, C . L., and Braddick, 0. J. 1985. Eccentricity-dependent scaling of the limits for short-range apparent motion perception. Vision Res. 25, 803-812.
1200
Wyeth Bair and Christof Koch
Barlow, H. B. 1972. Single units and perception: A neuron doctrine for percep1, 371-394. tual psychology. Pt7vct7/~fioi? Bialek, W., Rieke, F., d e Ruyter van Steveninck, R. R., and Warland, D. 1991. Reading a neural code. Scitwcr, 252, 1854-1857. Britten, K. H., Shadlen, M. N., Newsome, W. T., and Movshon, J. A. 1992. The analysis of visual motion: A comparison of neuronal and psychophysical performance. 1. Ntwuosci. 12, 17454765. Britten, K. H., Shadlen, M. N., Newsome, W. T., and Movshon, J. A. 1993. Response of neurons in macaque MT to stochastic motion signals. Visiinl Neirrosc-i. 10, 1157-1169. Derrington, A. M., and Lennie, P. 1984. Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. J . Plysiol. 357, 219-210. Evsel, U. T., and Burandt, U. 1984. Fluorescent tube light e\.okes flicker responses in \,ism1 neurons. Visiori Rrs. 24, 943-948. Foster, K. H., Gaska, J. P., Nagler, M., and Pollen, D. A . 1985. Spatial and temporal frequency selecti\,ity of neurones in cortical areas V1 and V2 of the macaque monkey. 1. Plysiul. 365, 331-363. Freund, T. F., Martin, K. A. C., and Whitteridge, D. 1985. Inner\,ation of cat visual area-17 and area-18 by physiologically identified X-type and Y-type thalamic afferents. 1. Arborization patterns and quantitative distribution of postsynaptic elmments. 1. Corrrp N c u d 242, 263-274. Gulyas, R., Orban, G. A., Duysens, J., and Maes, H. 1987. The suppressive intluence of moving textured backgrounds on responst’s of cat striate neurons to moiring bars. 1. Nwrciphysiol. 57, 1767-1791. Gulyas, A . I., Miles, R., Sik, A., Toth, K., Tamaniaki, N., and Freund, T. F. 1993. Hippocampal pyramidal cells excite inhibitory neurons through a single release site. X n f i { r i I, L o ~ ~ d o 366, ~ r l 683-687. I-fanimond, P., and MacKay, D. M. 1977. Differential responsiveness o f simple and complex cells in cat striate cortex to \.isual texture. E.Y;J.Briliri R t s 30, 275-296. Heggelund, P., and Albus, K. 1978. Response variability and orientation discrimination of single cells in striate cortex of cat. €.x{J. Braiir lics. 32, 197-211. Henry, G. H., Bishop, P. O., Tupper, R. M., and Dreher, B. 1973. Orientation specificity and response variability of cells in the striate cortex. Visioii Rty. 13, 1771-1779. Lee, B. B., Martin, P. R., and Valberg, A. 1989. Sensitivity of macaque retinal ganglion cells tit chromatic and lurninancc flicker. 1. Plysiol. 414, 223-243. Lestiennc, R., and Strehler, B. L. 1987. Time structure and stimulus dependence of precisely replicating patterns present in monkey cortical neuronal spike trains. Bvnirr R1.s. 437, 214-238. Lett\.in, J. P., Maturana, H. R., McCulloch, W. S., and Pitts, W. H, 1959. What the frog‘s eye tells the frog’s brain. h c . l r i c f . Rmf. €rig. 47, 3950-1961. Lcvitt, J. B., Kiper, D. C., and Movshon, J - A. 1994. Recepti1.e fields and functional architecture of macaque V2. 1. Ntwrqdys. 71, 2517-2542. Mainen, Z. F., and Sejnowski, T. J. 1995. Reliability of spike timing i n neocortical neurons. SCWJIW 268, 1503-1506.
Spiking Precision in Visual Cortex
1201
Maunsell, J. H. R., and Gibson, J. R. 1992. Visual response latencies in striate cortex of the macaque monkey. 1.Neurupkysiol. 68, 1332-1344. McLean, J., and Palmer, L. A. 1989. Contribution of linear spatiotemporal receptive field structure to velocity selectivity of simple cells in area 17 of cat. Vision Res. 29, 675-679. Newsome, W. T., Britten, K. H., and Movshon, J. A. 1989. Neuronal correlates of a perceptual decision. Nature (London) 341, 52-54. Press, H. P., Flannery, B. I?, Teukolsky, S. A., and Vetterling, W. T. 1988. Numerical Recipes in C , the Art c f Scientific Computing. Cambridge University Press, Cambridge. Richmond, B. J., Optican, 1L. M., Podell, M., and Spitzer, H. 1987. Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. I. Response characteristics. J. Neurophysiol. 57, 132-146. Richmond, B. J., Optican, I,. M., and Spitzer, H. 1990. Temporal encoding of two-dimensional patterns by single units in primate primary visual cortex. I. Stimulus-response relations. J. Neuruphysiol. 64, 351-369. Rieke, F., Warland, D., and Bialek, W. 1993. Coding efficiency and information rates in sensory neurons. Errrupkys. Lett. 22, 151-156. Schiller, P. H., Finlay, 8. L., and Volman, S. F. 1976. Short-term response variability of monkey striate neurons. Brain Res. 105, 347-349. Sclar, G., Maunsell, J. H. R., and Lennie, P. 1990. Coding of image contrast in central visual pathways of the macaque monkey. Visiial Res. 30, 1-10, Sestokas, A. K., and Lehm.kuhle, S. 1986. Visual response latency of X- and Y-cells in the dorsal lateral geniculate nucleus of the cat. Vision Res. 26, 1041-1054. Shadlen, M. N., and News,ome, W. T. 1994. Noise, neural codes and cortical organization. Ciirr. Opin. Newobiol. 4, 569-579. Smetters, D. K., and Nelson, S. B. 1993. Estimates of functional synaptic convergence in rat and cat visual cortical neurons. Soc. Neiirosci. Abstr. 19, 628. Snowden, R. J., Treue, S., and Anderson, R. A. 1992. The response of neurons in areas V1 and MT of the alert rhesus monkey to moving random dot patterns. E x p . Brain Res 88, 389400. Softky, W. R., and Koch, C. 1993. The highly irregular firing of cortical cells inconsistent with temporal integration of random EPSPs. J. Neurusci. 13, 334-350. Stevens, C. F. 1994. What form should a cortical theory take? In Large-Scale Neirrunal Theories of the Brain, C. Koch and J. L. Davis, eds., pp. 239-255. MIT Press, Cambridge, MA. Thomson, A. M., Deuchars, J., and West, D. C. 1993. Large, deep layer pyramidpyramid single axon E:PSPs in slices of rat motor cortex display pairedpulse and frequency-dependent depression, mediated presynaptically and self-facilitation, mediated postsynaptically. J. Neuroykysiol. 70, 2354-2369. Tolhurst, D. J., Movshon, J. A,, and Thompson, I. D. 1981. The dependence of response amplitude and variance of cat visual cortical neurones on stimulus contrast. E x p . Brain Res 41,414419. Tolhurst, D. J., Movshon, J. A,, and Dean, A. F. 1983. The statistical reliability
1202
Wyeth Bair and Christof Koch
of signals in single neurons in cat and monkey visual cortex. Visioii Res. 23, 775-785. k m k o , C. J., and Crapper, D. R. 1974. Neuronal variability: Non-stationary responses to identical visual stimuli. Bwin Res. 79, 405418. Vogels, R., Spileers, W., and Orban, G. A. 1989. The response variability of striate cortical neurons in the behaving monkey. Ex;’. Brniri RES.77, 432436. Werner, G., and Mountcastle, V. B. 1963. The variability of central neural activity in a sensory syfem, and its implications for the central reflection of sensory events. 1. Neirvophysiol. 26, 958-977.
Received August 7, 1995, accepted February 2, 19%
This article has been cited by: 2. Hideaki Shimazaki, Shigeru Shinomoto. 2010. Kernel bandwidth optimization in spike rate estimation. Journal of Computational Neuroscience 29:1-2, 171-182. [CrossRef] 3. Michael London, Arnd Roth, Lisa Beeren, Michael Häusser, Peter E. Latham. 2010. Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex. Nature 466:7302, 123-127. [CrossRef] 4. Jonathan Tapson, Craig Jin, André van Schaik, Ralph Etienne-Cummings. 2009. A First-Order Nonhomogeneous Markov Model for the Response of Spiking Neurons Stimulated by Small Phase-Continuous SignalsA First-Order Nonhomogeneous Markov Model for the Response of Spiking Neurons Stimulated by Small Phase-Continuous Signals. Neural Computation 21:6, 1554-1588. [Abstract] [Full Text] [PDF] [PDF Plus] 5. Barry J. Richmond. 2009. Stochasticity, spikes and decoding: sufficiency and utility of order statistics. Biological Cybernetics 100:6, 447-457. [CrossRef] 6. G. Zheng, A. Tonnelier, D. Martinez. 2009. Voltage-stepping schemes for the simulation of spiking neural networks. Journal of Computational Neuroscience 26:3, 409-423. [CrossRef] 7. Vladimir Itskov, Carina Curto, Kenneth D. Harris. 2008. Valuations for Spike Train PredictionValuations for Spike Train Prediction. Neural Computation 20:3, 644-667. [Abstract] [PDF] [PDF Plus] 8. M. Ozer, L. J. Graham. 2008. Impact of network activity on noise delayed spiking for a Hodgkin-Huxley model. The European Physical Journal B 61:4, 499-503. [CrossRef] 9. H. Kondgen, C. Geisler, S. Fusi, X.-J. Wang, H.-R. Luscher, M. Giugliano. 2007. The Dynamical Response Properties of Neocortical Neurons to Temporally Modulated Noisy Inputs In Vitro. Cerebral Cortex 18:9, 2086-2097. [CrossRef] 10. Daniel A. Butts, Chong Weng, Jianzhong Jin, Chun-I Yeh, Nicholas A. Lesica, Jose-Manuel Alonso, Garrett B. Stanley. 2007. Temporal precision in the neural code and the timescales of natural vision. Nature 449:7158, 92-95. [CrossRef] 11. Jonathon Shlens, Matthew B. Kennel, Henry D. I. Abarbanel, E. J. Chichilnisky. 2007. Estimating Information Rates with Confidence Intervals in Neural Spike TrainsEstimating Information Rates with Confidence Intervals in Neural Spike Trains. Neural Computation 19:7, 1683-1719. [Abstract] [PDF] [PDF Plus] 12. Hideaki Shimazaki, Shigeru Shinomoto. 2007. A Method for Selecting the Bin Size of a Time HistogramA Method for Selecting the Bin Size of a Time Histogram. Neural Computation 19:6, 1503-1527. [Abstract] [PDF] [PDF Plus] 13. L. Perrinet. 2007. Dynamical neural networks: Modeling low-level vision at short latencies. The European Physical Journal Special Topics 142:1, 163-225. [CrossRef]
14. Omri Barak, Misha Tsodyks. 2006. Recognition by Variance: Learning Rules for Spatiotemporal PatternsRecognition by Variance: Learning Rules for Spatiotemporal Patterns. Neural Computation 18:10, 2343-2358. [Abstract] [PDF] [PDF Plus] 15. Renaud Jolivet, Alexander Rauch, Hans-Rudolf Lüscher, Wulfram Gerstner. 2006. Predicting spike timing of neocortical pyramidal neurons by simple threshold models. Journal of Computational Neuroscience 21:1, 35-49. [CrossRef] 16. David J. Klein, Jonathan Z. Simon, Didier A. Depireux, Shihab A. Shamma. 2006. Stimulus-invariant processing and spectrotemporal reverse correlation in primary auditory cortex. Journal of Computational Neuroscience 20:2, 111-136. [CrossRef] 17. Kenneth D. Harris. 2005. Opinion: Neural signatures of cell assembly organization. Nature Reviews Neuroscience 6:5, 399-407. [CrossRef] 18. Rudy Guyonneau , Rufin VanRullen , Simon J. Thorpe . 2005. Neurons Tune to the Earliest Spikes Through STDPNeurons Tune to the Earliest Spikes Through STDP. Neural Computation 17:4, 859-879. [Abstract] [PDF] [PDF Plus] 19. Sentao Wang, Feng Liu, Wei Wang. 2004. Impact of spatially correlated noise on neuronal firing. Physical Review E 69:1. . [CrossRef] 20. Matteo Carandini. 2004. Amplification of Trial-to-Trial Response Variability by Neurons in Visual Cortex. PLoS Biology 2:9, e264. [CrossRef] 21. Romain Brette , Emmanuel Guigon . 2003. Reliability of Spike Timing Is a General Property of Spiking Model NeuronsReliability of Spike Timing Is a General Property of Spiking Model Neurons. Neural Computation 15:2, 279-308. [Abstract] [PDF] [PDF Plus] 22. Ivan Matus Bloch, Claudio Romero Z.. 2002. Firing sequence storage using inhibitory synapses in networks of pulsatil nonhomogeneous integrate-and-fire neural oscillators. Physical Review E 66:3. . [CrossRef] 23. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 24. Ken-ichi Amemori , Shin Ishii . 2001. Gaussian Process Approach to Spiking Neurons for Inhomogeneous Poisson InputsGaussian Process Approach to Spiking Neurons for Inhomogeneous Poisson Inputs. Neural Computation 13:12, 2763-2797. [Abstract] [PDF] [PDF Plus] 25. Stefano Panzeri , Simon R. Schultz . 2001. A Unified Approach to the Study of Temporal, Correlational, and Rate CodingA Unified Approach to the Study of Temporal, Correlational, and Rate Coding. Neural Computation 13:6, 1311-1349. [Abstract] [PDF] [PDF Plus] 26. Rufin Van Rullen , Simon J. Thorpe . 2001. Rate Coding Versus Temporal Order Coding: What the Retinal Ganglion Cells Tell the Visual CortexRate Coding Versus Temporal Order Coding: What the Retinal Ganglion Cells Tell the Visual Cortex. Neural Computation 13:6, 1255-1283. [Abstract] [PDF] [PDF Plus]
27. Arunava Banerjee . 2001. On the Phase-Space Dynamics of Systems of Spiking Neurons. I: Model and ExperimentsOn the Phase-Space Dynamics of Systems of Spiking Neurons. I: Model and Experiments. Neural Computation 13:1, 161-193. [Abstract] [PDF] [PDF Plus] 28. L. Menendez de la Prida, J. Sanchez-Andres. 1999. Nonlinear transfer function encodes synchronization in a neural network from the mammalian brain. Physical Review E 60:3, 3239-3243. [CrossRef] 29. W. Martin Usrey, R. Clay Reid. 1999. SYNCHRONOUS ACTIVITY IN THE VISUAL SYSTEM. Annual Review of Physiology 61:1, 435-456. [CrossRef] 30. Elad Schneidman , Barry Freedman , Idan Segev . 1998. Ion Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike TimingIon Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike Timing. Neural Computation 10:7, 1679-1703. [Abstract] [PDF] [PDF Plus] 31. Terence David Sanger . 1998. Probability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking NeuronsProbability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking Neurons. Neural Computation 10:6, 1567-1586. [Abstract] [PDF] [PDF Plus] 32. C. van Vreeswijk , H. Sompolinsky . 1998. Chaotic Balanced State in a Model of Cortical CircuitsChaotic Balanced State in a Model of Cortical Circuits. Neural Computation 10:6, 1321-1371. [Abstract] [PDF] [PDF Plus] 33. Charles F. Stevens, Anthony M. Zador. 1998. Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience 1:3, 210-217. [CrossRef] 34. Jack L. Gallant, Charles E. Connor, David C. Van Essen. 1998. Neural activity in areas V1, V2 and V4 during free viewing of natural scenes compared to controlled viewing. NeuroReport 9:9, 2153-2158. [CrossRef] 35. Jack L. Gallant, Charles E. Connor, David C. Van Essen. 1998. Neural activity in areas V1, V2 and V4 during free viewing of natural scenes compared to controlled viewing. NeuroReport 9:7, 1673-1678. [CrossRef] 36. Richard S. Zemel , Peter Dayan , Alexandre Pouget . 1998. Probabilistic Interpretation of Population CodesProbabilistic Interpretation of Population Codes. Neural Computation 10:2, 403-430. [Abstract] [PDF] [PDF Plus] 37. Jack L. Gallant, Charles E. Connor, David C. Van Essen. 1998. Neural activity in areas V1, V2 and V4 during free viewing of natural scenes compared to controlled viewing. NeuroReport 9:1, 85-89. [CrossRef] 38. S. Strong, Roland Koberle, Rob de Ruyter van Steveninck, William Bialek. 1998. Entropy and Information in Neural Spike Trains. Physical Review Letters 80:1, 197-200. [CrossRef]
39. Anne-Kathrin Warzecha, Martin Egelhaaf. 1997. How Reliably Does a Neuron in the Visual Motion Pathway of fhe Fly Encode Behaviourally Relevant Information?. European Journal of Neuroscience 9:7, 1365-1374. [CrossRef]
Communicated by Eric Mjolsness
Neural Network for Dynamic Binding with Graph Representation: Form, Linking, and Depth-from-Occlusion James R. Williamson Center for Adaptive Systems and Department of Cognitive and Neural Systems, Boston University, 677 Beacon Street, Boston, M A 02225 USA
A neural network is presented that explicitly represents form attributes and relations between them, thus solving the binding problem without temporal coding. Rather, the network creates a graph representation by dynamically allocating nodes to code local form attributes and establishing arcs to link them. With this representation, the network selectively groups and segments in depth objects based on line junction information, producing results consistent with those of several recent visual search experiments. In addition to depth-from-occlusion, the network provides a sufficient framework for local line-labeling processes to recover other three-dimensional (3-D) variables, such as edgelsurface contiguity, edge slant, and edge convexity. 1 Introduction
Visual object recognition in humans is largely invariant to viewpoint. The two-dimensional (2-D) projection of local object features and their relations change dramatically with change in viewpoint. However, the three-dimensional (3-D) !jtructural relationships that compose the object do not. Therefore, viewpoint-invariant form representations should explicitly encode the local form attributes and relations between them, so that the stable 3-D structural description can be recovered from the unstable 2-D information. These computational considerations, backed up by many psychophysical results, support structural description models of human visual shape classification, which have independent, explicit representations of form attributes, and relations between these attributes. Alternative approaches, such as template matching and feature list matching, instead ”trade off the capacity to represenl attribute structures with the capacity to represent relations” (Hummel and Biederman 1992). Models of early vision typically use a topographic representation, in which several features that compose a local form attribute are coded at each position in a 2-11 lattice. These features are explicitly bound by the architecture to their spatial position, but feature conjunctions at the same position, as well a:s structural relations between positions, such as Neural Computation 8, 1203-1225 (1996) @ 1996 Massachusetts Institute of Technology
1204
James R. Williamson
”same edge,” ”on top of,” or “belongs to,” are left implicit. To explicitly represent these relations requires biiidiiig, which is problematic in a neural network architecture due to cross-talk between the highly interconnected representational units (Hummel and Biederman 1992; Finkel and Sajda 1992, 1994; Sajda and Finkel 1995; Barlow 1981; Feldman and Ballard, 1982). One general binding approach is trvryoral codiiriy, in which attribute conjunctions are represented by temporal correlations in the outputs of different neurons (Hummel and Biederman 1992; Engel e t a / . 1992). Support for this idea stems from roughly 50 Hz oscillations found in cortex, with high temporal correlations between neurons responding to spatially distant parts of the same edge of a moving bar, but low correlations between neurons responding to the edges of different bars moving in opposite directions (Engel ct 01. 1992. This approach suffers from intrinsic capacity limitations, however, due to the limited temporal resolution of neurons. Only a small number of different bindings can be simultaneously encoded, and the length of time required to measure neural temporal correlations may be too great to subserire real-time visual binding (Hummel and Biederman 1992). An alternative binding approach is spatiiil coding, in which each attribute conjunction is explicitly coded by a separate unit (Feldman and Ballard 1982; Hinton c’f 01. 1986). However, using conjunctive codes at each lattice position results i n an unacceptable combinatorial explosion of units. The combinatorial explosion can be alleiTiated with spatial coarsecoding, although this is effective only given a spatially sparse distribution of attributes, since the degree of confusion between different attributes varies with the coarseness of the code, or inversely with the spacing between attributes (Hinton rit a/.1986). A spatial coding scheme (Feldman and Ballard 1982) that alleviates the combinatorial problem uses units with large dendritic fields and simple dendritic processing that make them receptive to local feature conjunctions, invariant to position in the sampling lattice. Here, a combinatorial explosion of conjunctive units is avoided, but at the cost of spatial uncertainty and inability to distinguish the number of copies of the same feature conjunction at different spatial positions. One solution to this problem would be to dynamically nssipi each conjunctive unit to a different, but useful, lattice position. The Graph of Relations and Form (CRAF) model takes this approach by dynamically binding nodes coding local form attributes to critical image locations, and then binding the nodes into links with each other, thus producing a graph representation capable of coding 3-D structure. In contrast to the many recent models of dynamic binding that use temporal codes, the GRAF model creates bindings with a combination of dynamic gating o f signals to large dendritic fields and competitive interactions. Therefore, the GRAF model suggests a completely new approach for solving the binding problem in distributed neural architectures.
Neural Network for Dynamic Binding
1205
The GRAF model uses local form attribute information to guide linking across gaps, linking behind objects (amodal linking), and depth segmentation based on occlusion, thereby producing results consistent with those of some recent visual search experiments (Enns and Rensink 1994; Rensink and Enns 1995). In addition, the GRAF model’s representation is sufficient to support local line-labeling processes for recovering 3-D variables such as surface contiguity, edge slant, and edge convexity (Rensink 1992; Enns and Rensink 1991).
1.1 Graph Representations. A graph representation of a visual scene can consist of graph nodes that code local scene attributes, and graph arcs that code relationships and control communication between them, useful for relaxation labeling (Hummel and Zucker 1983). The GRAF model approaches the binding problem by creating such a graph representation with a neural network. Note that we use the term node to refer to a graph node, and the term neuron to refer to the basic processing unit of the neural network. The GRAF model’s representation consists of groups of neurons that correspond to nodes and arcs. A node codes local form attributes and 3-D structural information where it is dynamically bound, while an arc links two nodes that represent different spatial locations, and controls all communication between them. Each node and each arc is made up of several neurons that serve a few different processing roles. For a node, these roles consist of coding where the node is dynamically bound, what form attributes the node represents there, in what directions the node should link with other nodes, and 3-D information, such as depth, that the node represents. For an arc, these roles consist of coding the angle and the distance between the locations where its two nodes are dynamically bound, as well as the strength and direction of a link between the two nodes.
1.2 Overview of GRAF Model. For a neural network to create such a graph representation, certain implementational constraints need to be realized. First, neurons coding local form attributes, which compose the nodes, should be spatially flexible. Static allocation of a node at each lattice position results in a combinatorial explosion of neurons, since each node requires many neurons to code a full conjunctive set of form features. Rather, nodes are dynamically allocated to spatial positions as a function of featural salience. Neurons composing a node are thus capable of coding local information from many possible spatial positions. Second, each node represents a unique spatial position, so that the mapping from coded spatial positions to nodes is one-to-one. Otherwise, a many-to-one mapping would entail losing spatial and featural identity, while a one-to-many mapping would result in an inefficient allocation of nodes.
1206
James R. Williamson
Third, links between pairs of nodes are explicitly coded. The strength of a link is based on the spatial positions and attributes coded by its two nodes. Due to the spatial flexibility of nodes, their spatial relations are recovered only after the nodes are dynamically allocated to positions. After nodes are allocated and code local feature conjunctions, a competitive selection process establishes links (active arcs), thus binding nodes with each other. The GRAF model is illustrated in Figure 1. In Figure la, four "potential'' nodes (circles), and six arcs (dotted lines), wait to code an input. Given a visual input of the triangle in Figure lb, a saliency map of the important lattice positions (Fig. l c ) is activated, and three of the nodes are dynamically allocated to the positions of local maxima in the saliency map, coding the local form attributes (Fig. Id). Based on spatial relations and form attributes of the nodes, the appropriate arcs become activated, as shown by bold dotted lines in Figure le. The information coded by the nodes and arcs is schematically shown in Figure If, demonstrating a compressed representation of the object with explicit bindings between line junctions. In the remainder of the paper, the GRAF model's equations are described and simulations of the model are shown. The GRAF model consists of four stages: ( I ) fe~7tirr~,estractiorz, in which simple features are topographically represented; (2)featurt. abstraction, in which the topographic feature representation is spatially abstracted into nodes, which code form attributes and their spatial positions; (3) attribute linking, in which nodes establish pairwise links (activated arcs) based on their form attributes and spatial relations; and (4) depth scy~ientation,in which relative depth is initially estimated at each node, based on local form attributes, and estimates are subsequently refined by relaxation between linked nodes. 2 Feature Extraction - -
.
~~
The first stage of the GRAF model is feature extraction within retinotopic coordinates. The output of this stage consists of local form measurements from oriented complex and end-stopped cells at each 2-D lattice position. Oriented complex cells represent smooth object boundaries, while end-stopped cells represent boundary discontinuities or segments of high curvature. Many types of boundaries, such as texture boundaries and illusory contours, are currently ignored for the sake of simplicity. The complex and single end-stopped cell responses are obtained using oriented filters and subsequent nonlinearities in a process adapted from Heitger ct al. (1992). Complex cells represent boundaries, invariant to direction-of-contrast, in K orientations (K = 12). Single end-stopped cells represent boundary discontinuities in 2K directions. The complex and end-stopped responses are combined across orientation to produce two saliency maps, a confiniintioir salienq niap (CSM), and a junction saliency
Neural Network for Dynamic Binding
1207
n
f)
Figure 1: Illustration of GRAF model: (a) four unallocated nodes (circles), and six arcs (dotted lines); (b) input image of triangle; ( c ) resulting activity of saliency map, with local maxima at line junctions; (d) three nodes are allocated to local maxima of saliency map, coding position, and local form attributes; (e) based on codes of graph nodes, arcs between appropriate pairs are activated; (f) information coded by network is depicted by placing nodes at their coded positions, iconically representing form attributes coded by each node, and showing active arcs between nodes.
map USM). The CSM is computed from the maximum complex cell response following oriented contrast-enhancement,
1208
James R. Williamson
= maxi&,.O), C, ,,I denotes the complex cell response at powhere i.,]sition ( x. y) and orientation k , and Ba is an oriented difference-of-offsetgaussian filter made up of three gaussian kernels with 0 2, a positive center kernel, and two (half-strength) inhibitory flanking kernels, perpendicularly offset by one standard deviation. The ISM is computed from end-stopped cells with shunting center-surround enhancement, 7
(2.2) where E, denotes the net end-stopped response (averaged across orientation) at position [x.,y\, and G, refers to a gaussian kernel with a 0.01. i = 10.0, 7 = 1.25, standard deviation of (T. Parameters are ( t and X = 35.0. The feature extraction stage is schematically illustrated in Figure 2, which shows the CSM and ISM resulting from an example input image. 3 Feature Abstraction
In the feature abstraction stage, nodes are allocated to lattice positions, where they code local form attributes, as illustrated in Figure Id. This process occurs in parallel for two sets of nodes, continuation and junction nodes, which code smooth boundaries and boundary discontinuities, respectively. Continuation nodes are allocated to active locations of the CSM, while being prevented from coding the same locations as junction nodes, which are allocated to active locations of the ISM. Each node has several types of neurons, p7sitiori (P),fbr.iun t t r i h t r (FA),~qt‘oiq’iiig (G), and r l q ~ f h(D) neurons. P neurons code the location that a node binds to, FA neurons code the form attributes at that location, G neurons code the directions in which a node tries to group with other nodes, and D neurons code depth. 3.1 Allocation of Nodes. Each C S M or ISM neuron, in retinotopic coordinates, sends output to a single P neuron of several different nodes. The P neurons in each node make up a P-map, which is topographic and in one-to-one correspondence with a spatially offset chunk of a CSM or ISM. For flexibility, P-maps of nearby nodes spatially overlap, as shown in Figure 3a. A node is allocated t o an acti\re saliency map location by c/ioosiizLq a single P neuron through two simultaneous competitive processes. The first is winner-take-all competition between different nodes tor the same general location of the saliency map. This is accomplished by feedback suppression from P neurons to the nearby output signals from the saliency map that feed to the other nodes, as well as to nearby I-’ neurons of the same node. The srcorid is winner-take-all competition tor different spatial locations between neurons of each P-map. These two inhibitory processes are illustrated in Figure 3a. The allocation process
Neural Network for Dynamic Binding
1209
Original Image
Simple Cells
Continuation Saliency Map c
obtain maximum across orientation
Junction Saliency Map ___)
Sum across orientation, contrast-enhance
Figure 2: Feature extraction example, given a 100 x 100 pixel input image of three overlapping bars. Activity in CSM and ISM is shown. establishes one-to-one mappings between salient locations and nodes. A node's chosen P neuron determines the image location where the "templates" of the form attribute neurons are centered, and thus the location coded by the node. P neurons are activated as follows. We use a geometry in which the coordinates (g. h ) of nodes, ( i . j ) of P neurons, and (x.y) of saliency and feature maps are related by (x. y)
=
(log + i, 10h + j )
(3.1)
We have 7 x 7 junction nodes, each with a 40 x 40 P-map, and 9 x 9
1210
James R. Williamson
Figure 3: (a) Competitive processes that ensure one-to-one mappings from local maxima in a saliency map to a single neuron in a position (P) map are shown. Inhibitory signals are depicted with bold lines. The active P neuron inhibits all other neurons in its map, and also suppresses output signals from nearby positions in the saliency map to its map and to nearby, overlapping, P-maps. (b) Positional gating of form attribute (FA) neurons by an active P neuron is shown. Bold lines indicate positional "where" signals, while thin lines indicate featural "what" signals.
Neural Network for Dynamic Binding
1211
continuation nodes, each with a 20 x 20 P-map. P neurons obey a competitive shunting equation,
where S;Yh,r,l is the strength of the input signal, and [Pg,h,l,, - 71’ is a P neuron’s output signal. Sg./,,r,lis determined by three factors: (1)saliency map activation at (x.y), (2) intrinsic pathway strength, and (3) feedback suppression from other P neurons. For continuation nodes, S , J , is determined by
and for junction nodes,
Parameters are n = 0.2, y = 0.1, and X = 10.0. The intrinsic pathway strengths El,, decrease gradually from the center of a P-map, and also have small random variations for symmetry breaking. Feedback suppression, F$l,l from continuation nodes and FL./l,l,,from junction nodes, is a sum of the output signals of P neurons, modulated by a unit-height gaussian (c= &) that falls off as a function of distance in absolute ( x . y) coordinates. 3.2 Coding of Local Form Attributes. Each node has many FA neurons, each of which is receptive to different conjunctions of complex and end-stopped cells, and so represents a different precise form attribute. Each FA neuron applies a template to the input pattern made up of signals from complex and end-stopped cells. The FA neuron with best matching template to the input pattern inhibits all other FA neurons. The FA neurons thus function as a lookup table. A future modification of the model could be to code form attributes in a more distributed fashion, which would aid processing of noisy, real-world images. Each FA neuron can “apply” its template anywhere within a large spatial extent, because it has a large dendritic field that receives several spatially separated input copies. An FA neuron responds only where the node is dynamically allocated, however, because its dendritic field is gated by the active P neuron, as illustrated in Figure 3b. Figure 4a shows the feature abstraction result based on the feature extraction in Figure 2. Nodes are illustrated in their spatially allocated positions. Junction nodes are depicted with circles, with icons showing their coded form attributes. Continuation nodes only code boundary orientation, so they are depicted with oriented bars.
1212
James R. Williamson
Figure 4: (a) Feature abstraction example given the feature extraction result shown in Figure 2 as input. Junction nodes are depicted by circles with icons showing the coded form attribute. Continuation nodes are depicted by oriented bars. Nodes are shown in their bound positions. (b) Linking of nodes b
Neural Network for Dynamic Binding
1213
4 Linking Form Attributes
Nodes are linked together based on their form attribute codes and spatial relations. A node’s chosen FA neuron determines the directions it can group. Each node has 2K G neurons that code the grouping strength in each of the possible 2K directions. Grouping strength in each direction is determined by modulating the featural signals of the complex and endstopped cells by the magnitudes of the template elements of the chosen FA neuron. The complex and end-stopped cell activations are first passed through the compressive functionfG(x) = x/(O.l x) in order to make their strengths more equal. In addition, if the chosen FA neuron suggests an amodal grouping possibility (i.e., the neuron codes a T-junction or a termination), then the grouping strength in the direction of the ”stem” is copied into the G neuron in the opposite, amodal direction. Arcs code the pairwise links between nodes. Arcs are composed of neurons coding the distance and angle between nodes, as well as neurons coding the strength and direction of internode linking. As soon as the two nodes are spatially allocated, the spatial relation between the nodes is recovered. The two nodes’ chosen P neurons activate, via second order (multiplicative) connections, distance and angle neurons of the internode arc, so that the angle 0 and distance d between the nodes‘ loci are represented, as shown in Figure 5a. In our simulations, the activation of distance and angle neurons was approximated by directly calculating the distance and angle between nodes based on their maximally active P neurons, p,qlI and P~s~,~l~,,~,,~,
+
Dist(g.k.g’. k’)
=
Angle(g.k.g’. k’)
=
d ( x - x ’ ) ~+ (y - Y ’ ) ~ K / T arctan(y - y’. x - x’)
(4.1) (4.2)
where, again, the variables x.y and g. k . i.j are related by (3.1). 4.1 Initial Estimate of Linking Strength. Once an arc codes the spatial relation between its two nodes, the strength of their link is initially estimated based on the amount of agreement between the grouping directions of the two nodes. This is determined by excitatory input to the arc’s linking ( L ) neurons, each of which codes the link in a particular direction. First, the distance and angle neurons excitatorily gate L neurons for appropriate possible linking directions, thus forming a dynamic template (Fig. 5b). Next, each node sends excitatory grouping signals, based on its coded form attributes, to the L neurons. The degree to which these signals match with the excitatory gating of the L neurons determines the strength of the net input signal (Figure 6a). Note that the nodes in Figure 6a can potentially link vertically and horizontally based on their form attributes, yet the dynamic template in the L neurons (indicated by shading) allows only horizontal linking.
James R. Williamson
1214
Absolute Coordinates Position Maps of two graph nodes X n \
Distance Neurons Angle Neurons
\ '
w
d
'\\
<,& \
00000000
7 loci of graph nodes I Distance Neurons
0
possible linking directions
Figure 5: (a) The spatial location coded by each node's active P neuron is combined, via second-order (multiplicative) connections, with that of the other node to activate distance and angle neurons of the connecting arc that specify the sirrrrizt spatial relation between the nodes. The spatial relation consists of the distance d and angle H between the nodes, as shown to the right. (b) Linking ( L ) neurons are gated by active distance and angle neurons, forming a dynamic template of possible linking directions, given the spatial relation of the nodes' loci, as shown to the right.
Linking is based o n a measure of p o d coiztiniinfioii between the boundary inducers, using constraints similar to those of many previous models (Crossberg a n d Mingolla 1985; Parent a n d Zucker 1989; Kellman a n d
Neural Network for Dynamic Binding
Junction Node A
b)
1215
Junction Node B
Modal Grouping Spatially fuzzy, short-range, colinear
Amodal Grouping T-junctions Terminations
Spatially fuzzy, short-range, and colinear
@+
~~~~~~
,,/
@+
Figure 6: (a) Nodes send grouping signals based on their coded form attributes. If these signals match with the dynamic template established at the L neurons (indicated by shading), then the L neurons are activated. Thus, the nodes depicted above link horizontally but not vertically. (b) Modal grouping signals of continuation and junction nodes are colinear with spatial fuzziness. Amodal grouping signals only come from junction nodes coding T-junctions or terminations. These signals have an added long-range cocircular component.
James R. Williamson
1216
Shipley 1991; Hummel and Biederman 1992). The initial estimate of linking strength in direction k is the input, I,s,~l.,s~.~,~.~, to an L neuron, L,v.~l.,T~.~IJ.~. The magnitude of this input is determined by the angle and distance between the two nodes, indexed by (g.k ) and (g’.k’), as well as by the nodes’ grouping signals. The nodes’ grouping signals in opposite directions {k} and { (k K ) mod 2K} are multiplied together, and the result is modulated by a term with gaussian falloff as a function of the distance between the nodes, and of the difference between the relative angle of the nodes, and the grouping angle k.
+
Is./i,,y’./i~,k
= Gg.li.k G,qr./i’.(k+K)rnod?K
exp { -[Dist(g.h.g’. h’)I2/(2n:) - ((-)-)dif,
[k. Angle(g. h.g’. l 1 ’ ) ] ) ~ / ( 2 0 ; .) }
(4.3)
The function @d,ff returns the absolute difference between angles (ranging from 0 to K - 1). The parameters are n,i = 12.5 and no = 3.0. The form of this short-range grouping is illustrated in Figure 6b, which shows spatially “fuzzy” colinear grouping signals of continuation and junction nodes. In addition, if the two nodes have amodal grouping signals, then an additional amodal component is added. The amodal component is determined the same way as in equation (4.3), except that the grouping directions of the two nodes, k’ and k”, are cocircularly related: k”
=
{k’
+ K + 2@dlff[k.Angle(g. h.8’. k ’ ) ] } mod
2K.
(4.4)
The amodal grouping component is longer-range with less orientational fuzziness, with parameters o,! = 25.0 and no = 0.8. The form of this additional amodal component is illustrated in Figure 6b. 4.2 Competition between Links. The initial link estimates of different arcs must compete with each other so that, locally, the most likely links are selected. To appreciate the importance of this, consider two nearby parallel boundaries. Links should exist only along each of these boundaries, yet some initial activation of links between the different boundaries is inevitable. The existence of strong links along each boundary should thus suppress all links between the boundaries. How do two different arcs “know“ if they should compete with each other? First, competition is possible only if the two arcs share a common node. Second, arcs compete only if they code links in similar directions. This constraint is enforced by direct competition between the L neurons of different arcs. Activation of an L neuron obeys a competitive shunting equation,
where the neuron is excited by its input and recurrent feedback, Excite = Is J~ ,T’,/l’,k
+ (Ls,/l,s~ i,~.k)~
(4.6)
Neural Network for Dynamic Binding
1217
and is inhibited by similarly oriented L neurons of other arcs that join one of its two nodes, ( g . k ) or (g’.k’). Here, we sum over (g”.k”) f (g.k ) , (8’. k‘).
Inhibition falls off gaussianly (n= 1.5) with orientational difference, Bf,ii,if(k. k’) = 2(&n)-’
exp{ -[Odlff(k.k’)]2/(2n2)}
(4.8)
A simulated example of linking, based on the feature abstraction example shown in Figure 4, is shown in Figure 4b (before) and Figure 4c (after) interarc competition. 5 Depth Segmentation Now that the nodes are linked together, a framework exists for information propagation to refine local 3-D estimates, based on more global information. Much work has been done on the use of local operations for line-labeling and occlusion-based depth segmentation (Rensink 1992; Finkel and Sajda 1992; Grossberg 1994, 1995). A key to this process is proper control of communication between local representations. The GRAF model achieves this control with explicitly coded arcs that bind the nodes into relations with each other. Currently, the only 3-D structure processing performed by the GRAF model is relative depth segmentation based on occlusion relationships, a process similar to that proposed by Finkel and Sajda (1992, 1994) and Sajda and Finkel (1995). The depth segmentation process is illustrative of how other 3-D variables could be recovered, such as surface contiguity, edge slant, and edge convexity. Such line-labeling processes would require explicit coding of L-, Y-, and arrow-junctions, and would use the same gating mechanism that controls depth segmentation. The GRAF model segments boundaries in depth at T-junctions. Communication between a pair of nodes, gated by their connecting arc, enforces the same depth, while communication between two depth representations at a T-junction enforces higher depth for the top bar, and lower depth for the bottom stem. This communication consists of excitatory and inhibitory interactions between coarse-coding D neurons. Each node contains N D neurons, which are excited and inhibited by the D neurons of other nodes it is linked to. Depth signals between two nodes, along their linked direction 8, are gated by the summed activity of L neurons at their connecting arc,
1218
James R. Williamson
The linked direction H in (5.1) is determined as follows. Each node has a small discrete set of grouping directions, { H } , which is determined by the node‘s chosen FA neuron. The grouping direction in which any two nodes are linked is that H which is closest to their relative angle, H = arg Wmin E { B ) Od,ff[H’. Angle(g.1z.g’.h’)]
(5.2)
All depth inputs in one linked direction are summed. Inputs consist of on-center excitation of neurons coding similar depths, and off-surround inhibition of neurons coding different depths. The net (excitations or inhibitions) from different linking directions are then multiplied together. d ~ t , y=, ~ - ~ lY,D,~f. I/ j~ . ~(1~- D,y.~l.ll) Excite -D,,I,,,, Inhib
+ +
Excite
=
n
(5.3)
N
c n cLink,f/:,cy,,/l, cSur(n
Link~fl~,,T,,ll, Cen(n - n’)D~y~./f~,l,r
0 g’h’
(5.4)
11‘=1
N
Inhib
=
- n’)DS~.~,~,,,~
(5.5)
d=l
0 ‘y“
In (5.3)-(5.5), n indexes depth neurons, and 15’ indexes the grouping directions in which a node has links. The excitatory and inhibitory kernels are Cen(n ) Sur(n)
=
x exp[-d/( ~N~o’)]
=
{I - e ~ p [ - r z ~ / ( 2 ~ ~ 0 ~ ) ] }
(5.6) (5.7)
Finally, if a node codes a T-junction, then two sets of D neurons are used (one for the top bar, and the other for the stem of the T). In this case, cross inhibition between the two sets of D neurons is added to the Inhib term in (5.5) to push the top bar to a higher depth, and the bottom stem to a lower depth,
where
+
+
Up(n) = 1 + n / ( ~ Inl). Down(n) = 1 - n / ( ~ Inl)
(5.9)
and D‘ denotes the other set of depth neurons at the same node. Parameters are N = 10, (Y = 0.1, [j = 0.1, X = 3.0, o = 0.075, E = 0.5. A simulated example of depth segmentation, given the final linked representation depicted in Figure 4c, is shown in Figure 4d, which shows junction nodes (points) and locally maximum arcs (lines) in a 3-D plot, in which the ordinate represents depth as coarse coded by the D neurons.
Neural Network for Dynamic Binding
1219
6 Simulations
6.1 Grouping across Gaps. Rensink and Enns (1995) used visual search tasks to show that an apparent length illusion induced by MullerLyer stimuli is obtained rapidly. Thus, if a target Muller-Lyer figure is "wings-out" and distractor figure is "wings-in,'' then search of the wingsin target among wings-out distractors is slow only if the entire figures, including the wings, are of the same length. They used this result to explore conditions under which rapid grouping across gaps binds contour fragments together. Figure 7a (left) shows example target/distractor pairs, in which the target is above the distractor, adapted from Rensink and Enns (1995). In this example, the entire distractor wings-in figure is shorter than the target wings-out figure, and each figure contains one or more gaps. In their experiment, if the gap was in the middle of the connecting bar, as shown in the top two cases, then search was equally fast, regardless of whether the gap was small or large. This result indicates that the pieces were bound together across the gaps. The GRAF model produces consistent results, shown to the right of the target/distractor pairs, in which the figures are linked across the gaps. On the other hand, if two gaps were placed so that the Y-junctions became L-junctions, as in the bottom two cases, then search was slow regardless of whether the gaps were large or small. This result indicates that the pieces were not bound together across the gaps. The GRAF model produces consistent results, in which the pieces are not linked across the gaps. Figure 7b shows variations involving a center gap. Here, the overall target and distractor figures are of the same length, so fast search indicates a lack of binding of the pieces, and slow search indicates binding of the pieces; see Rensink and Enns (1995) for details. If the right-hand pieces were shifted vertically, so that cocircular interpolation of the segments across the gap was removed (top two cases), then binding took place only if the gap was small (as indicated by the search results). If both segments were bent so that cocircular interpolation still joined them (bottom two cases), the pieces were bound whether the gap was small or large. The GRAF linking results are consistent with these experimental results, as shown in Figure 7b. The GRAF model is able to produce the above results because (1)it employs spatial grouping rules that embody the principle of good continuation, and (2) linking is attempted only in directions that are appropriate given the explicitly coded line-junctions. Other vision models that employ the principle of good continuation may also be able to account for many of the same results (Grossberg and Mingolla 1985; Kellman and Shipley 1991; Hummel and Biederman 1992).
James R. Williamson
1220
>-(
<>-
;>-4 <-\
>< c-3
Figure 7: Simulation results for Muller-Lyer stimuli, adapted from Rensink and Elins (1995), are shown. Input stimuli are shown on the left of each column, and the linked GRAF representation is shown on the right, with junction nodes (points),and locally maximum arcs (lines) plotted. Linking across gaps obtained by the GRAF model is consistent in all of these cases with experimental results. 6.2 Amodal Completion and Depth Segmentation. The examples in Figure 4 show the GRAF model’s “recovery” from occlusion, which consists of amodal completion and depth segmentation. Enns and Rensink (1994) showed that introducing small gaps between the occluding and occluded objects causes dramatic changes in visual search, presumably because the two segments of the occluded object are no longer bound together. Figure 8 illustrates this finding, where occlusion in Figure 8a results in amodal linking of the occluded object and subsequent depth segmentation of the objects. In Figure Sb, on the other hand, small gaps separating the objects result in no amodal linking and no depth segmentation. Although the GRAF model uses no surface representation, it still
Neural Network for Dynamic Binding
1221
demonstrates some rudimentary intelligence in its amodal linking, using only boundary information. Figure 8c shows an input image similar to that of Figure 8a, except that the right-hand segment is shifted vertically by half its height. The result is that before competition in the linking stage, three different amodal links are equally strong (middle). Due to interarc competition, however, the two correct links are chosen (right). The depth segmentation examples so far have been globally consistent. Given a scene that is globally inconsistent (assuming planar objects with no 3-D slant), the GRAF model finds the best compromise between local cues for different-depth at T-junctions, and same-depth along boundaries, producing figures bent in depth in a spline-like way (Fig. 8d). A complete model for amodal completion and depth from occlusion would require explicit representation of surfaces (Nakayama and Shimojo 1990). Future extensions to the GRAF model might represent each surface with a node in a ”surface” graph.
7 Discussion 7.1 Comparison to Other Models. Unlike the GRAF model, other vision models, such as Finkel and Sajda’s model of intermediate-level visual representations (Finkel and Sajda 1992,1994; Sajda and Finkel 1995), Hummel and Biederman’s dynamic binding model (Hummel and Biederman 1992), and Grossberg’s FACADE theory of 3-D vision (Grossberg 1994, 1995), bind their representations within a 2-D lattice representation. These models also differ from each other in how they establish their bindings. The models of Finkel and Sajda and Hummel and Biederman use temporal codes in their boundary representations. Grossberg’s model primarily uses filling-in of surface representations within boundary compartments. As we have shown, the GRAF model uses a novel spatial binding mechanism that encodes explicit links between local form attributes. We believe that a primary criterion for evaluating vision models is to determine how well their binding mechanisms work.
7.2 Psychophysical Data. What is the psychophysical evidence that sophisticated form representations are obtained in a purely bottom-up ”preattentive” manner, as suggested by the GRAF model? Recent experiments using the visual search paradigm have shown that representations of 3-D structure are rapidly obtained, which requires integrating complex information from localized regions, such as line junctions, across objects (Enns and Rensink 1991). In addition, rapid grouping of disconnected figures is obtained where the grouping is again dependent on complex information at localized regions (Enns and Rensink 1994; Rensink and Enns 1995). Therefore, rapid, presumably preattentive, visual processes appear to possess sophisticated grouping and depth recovery capabilities.
1222
James R. Williamson
Figure 8: Simulation results demonstrating amodal completion and depth segmentation are shown. The left column shows input images. (a) Amodal linking o f occluded bar (middle), and depth segmentation (right) is shown. (b) No amodal linking (middle), or depth segmentation (right), occurs if there are gaps between the segments. (c) If the right-hand segment is shifted vertically by half its height, then initial amodal links are inconsistent (middle), but interarc competition produces correct amodal linking (right). (d) Depth segmentation bends in a spline-like way to satisfy local depth constraints for a globally inconsistent figure.
7.3 Physiological Data. What is the physiological support for the GRAF model? The model predicts that extrastriate cells should be found with classical receptive fields much larger than the size of their optimal stimulus. These cells should, at any one time, respond to only a portion of their receptive field, ignoring the rest. Many extrastriate cells in areas
Neural Network for Dynamic Binding
1223
V2 and V4 have been found with classical receptive fields much larger than the size of their optimal stimuli (Desimone and Schein 1987; Hubel and Livingstone 1985). Many V4 receptive fields can apparently be restricted to a subregion that corresponds to an attended location (Moran and Desimone 1985). Models to explain these data posit attentional gating of receptive fields (Van Essen and Anderson 1990; Desimone 1992), or a feature-based suppression mechanism (Desimone 1992). The GRAF model predicts that stimulus-driven receptive field restriction also occurs independently of focused attention. Another prediction is that many of the cells with these receptive field properties should be receptive to complex form features, such as line junctions. 7.4 Summary and Conclusions. With its novel binding architecture, the GRAF model provides a new interpretation for the role of large dendritic fields and fan-in/fan-out of connections besides mere filtering or feature detection. The main innovations of the GRAF model are thus (1) dynamically binding nodes to locations to explicitly code form attributes, and (2) competing between links that bind form attributes into explicit relationships to provide a framework for depth segmentation and linelabeling processes. These processes can indirectly aid the recovery of 3-D shape, and, ultimately, object recognition, by providing good estimates of several 3-D variables. Additionally, the model's graph representation may directly aid the recovery of 3-D shape through a graph matching process, such as that shown in Dickinson, et al. (1992).
Acknowledgments Supported in part by AFOSR (F49620-92-J-0225), ARPA (N00014-92-J4015), NSF (IRI-90-24877and IN-90-00530), and ONR (N00014-91-J-4100). The author wishes to thank three anonymous reviewers for their many helpful comments, and Steve Grossberg and Ennio Mingolla for their helpful comments on an earlier version of the paper.
References Barlow, H. B. 1981. Critical limiting factors in the design of the eye and visual cortex. Proc. R. Sac. (London) B212, 1-34. Bergen, J. R. 1991. Theories of visual texture perception. In Spatial Vision, D. M. Regan, ed., pp. 114-134. Macmillan, New York. Biederman, I. 1987. Recognition-by-components: A theory of human image understanding. Psychological Review 94, 115-147. Desimone, R. 1992. Neural circuits for visual attention in the primate brain. In Neural Networks for Vision and Image Processing, G. A. Carpenter, ed., pp. 343364. MIT Press/Bradford Books, Cambridge, MA.
1224
James R. Williamson
Desimone, R., and Schein, S. J. 1987. Visual properties of neurons in area v4 of 57, 835-868. the macaque: Sensitivity to stimulus form. 1. Ntwrci~~liysial. Dickinson, S. J., Pentland, A. P., and Rosenfeld, A. 1992. 3-d shape recovery using distributed aspect matching. IEEE Trotis PAMI 14, 174-198. Donnellv, N., Humphreys, G., and Riddoch, M. 1991. Parallel computation of primitive shape descriptions. J . E q i . Ps!/ciiol. 17, 561-570. Engel, A,, Konig, P., Kreiter, A., Schillen, T., and Singer, W. 1992. Temporal coding in the visual cortex: New vistas on integration in the nervous system. Twirls Ncirrosc-i. 15, 218-226. Enns, J. T., and Rensink, R. A. 1991. Preattentive recovery of three-dimensional orientation from line drawings. Psyc-/XI/.Rrzi. 98, 335-351. Enns, J. T., and Rensink, R. A. 1994. An object completion process in early T.ision. in Visirnl Scnrcii 111, A. Gale, ed. Taylor & Francis, London. Feldman, J. A,, and Ballard, D. H. 1982. Connectionist models and their properties. Cog. Sci. 6 , 205-254. Finkel, L., and Sajda, P. 1992. Object discrimination based on depth-fromocclusion. Neirml CUlti~J.4, 901-921. Finkel, L., and Sajda, P. 1994. Constructing visual perception. Am. Sei. 82, 223-237. Grossberg, S. 1973. Contour enhancement, short term memory, and constancies in reverberating neural networks. Stirrfics Appl.Mntlr. LII, 213-257. Grossberg, S. 1994 3-d vision and figure-ground separation by visual cortex. P t ~ ~ c t Ps!/c/Io/I/I!/s. ~)~t. 55( I), 48-120. (hssberg, S. 1995. Cortiiiil ilyrinniics of3-dyop-oirtnizd nitroild conipletioti: Tlzr tradeoff hr f i(It? J 1 yrorr I e t ry n t i d cot i t rnst . Tech. Rep. C AS / CNS-95-013, Department of Cognitive and Neural Systems, Boston University, Boston, MA. Grossberg, S., and Mingolla, E. 1985. Neural dynamics of perceptual grouping: 38, Textures, boundaries, and emergent segmentations. Perct>yt.Ps,~/c/ioph,t/~. 141-171. Heitger, F., Rosenthaler, L., \’on der Heydt, R., Peterhans, E., and Kubler, 0. 1992. Simulation of neural contour mechanisms: From simple to endstopped cells. Vrsiori R r S . 32(5), 963-981. Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. 1986. Distributed reprei~g: ill t l v Microstriictirw sentations. In Pnrnlid Distrilurtcil P r c ~ ~ i ~E.~-p/ornfiotis c!f Cogtiifion. Vulrrirrt~I: Fuiri~htioiis,D. E. Rumelhart, J. L. McClelland, and PDP Research Group, pp. 77-109. MIT Press/Bradford Books, Cambridge, MA. Hubel, D. H., and Li~ingstone,M. S. 1985. Complex-unoriented cells in a subregion of primate area 18. Nntirrt, (Luridori) 315, 325-327. Hummel, J. E., and Biederman, I. 1992. Dynamic binding in a neural network for shape recognition. Psyclid. Rev. 99, 480-517. Hummel, R. A., and Zucker, S. W. 1983. On the foundation of relaxation labeling processes. I E E E PAM1 5, 267-287. Keliman, P. J., and Shipley, T. F. 1991. A theory of visual interpolation in object perception. Cog. Psycliol. 23, 141-221. I.owe, D. 1987a. Three-dimensional object recognition from single two-dimensional images. Artificinl I n t d l i p i c t ’ 31, 355-395.
Neural Network for Dynamic Binding
1225
Lowe, D. 198%. The viewpoint consistency constraint. lnt. J. Compiiter Vision 1, 57-72. Mackworth, A. 1976. Model-driven interpretation in intelligent vision systems. Perception 5, 349-370. Malik, J. 1987. Interpreting line drawings of curved objects. lnt. J. Computer Vision 1,73-103. Mohan, R., and Nevatia, R. 1992. Perceptual organization for scene segmentation and description. I E E E PAM1 14, 616-635. Moran, J., and Desimone, R. 1985. Selective attention gates visual processing in the extrastriate cortex. Science 229, 782-784. Morrone, M., and Burr, D. 1988. Feature detection in human vision: A phasedependent energy model. Proc. R. SOC.London B 235, 221-245. Nakayama, K., and Shimojo, S. 1990. Toward a neural understanding of visual surface representation. Cold Spring Harbor Symp. Quant. Biol. 40, 911-924. Parent, I?, and Zucker, S. W. 1989. Trace inference, curvature consistency, and curve detection. l E E E PAM1 11, 823-839. Rensink, R. A. 1992. The rapid recovery of three-dimensional orientation from line drawings. Ph.D. Thesis (also Tech. Rep. 92-25), Department of Computer Science, University of British Columbia, Vancouver, BC, Canada. Rensink, R. A,,and Enns, J. T. 1995. Pre-emption effects in visual search: Evidence for low-level grouping. Psycho/. Rev. 102, 101-130. Sajda, P., and Finkel, L. 1995. Intermediate-level visual representations and the construction of surface perception. J. Cog. Neurosci. 7, 267-291. Treisman, A. 1985. Preattentive processing in vision. Comput. Vision Graphics, lmage Process. 31, 156-177. Van Essen, D. C., and Anderson, C. H. 1990. Information processing strategies and pathways in the primate retina and visual cortex. In An Introduction to Neural and Electronic Netziiorks, S . F. Zornetzer and L. Davis, eds., pp. 43-72. Academic Press, San Diego. Wang, D., Buhmann, J., and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neural Comp. 2, 94-106. Williamson, J. 1993. Dynamic binding of visual contours without temporal coding. Proc. WCNN-93, 1, 97-100.
Received November 7, 1994; accepted February 2, 1996.
This article has been cited by: 2. Axel Thielscher, Heiko Neumann. 2008. Globally consistent depth sorting of overlapping 2D surfaces in a model using local recurrent interactions. Biological Cybernetics 98:4, 305-337. [CrossRef]
Communicated by William Lytton
Neuronal-Based Synaptic Compensation: A Computational Study in Alzheimer’s Disease David Horn Nir Levy School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Auiu University, Tel Auiu 69978, lsrael Eytan Ruppin Departments of Computer Science and Physiology, El-Aviu University, Tel Auiu 69978, Israel
In the framework of an associative memory model, we study the interplay between synaptic deletion and compensation, and memory deterioration, a clinical hallmark of Alzheimer’s disease. Our study is motivated by experimental evidence that there are regulatory mechanisms that take part in the homeostasis of neuronal activity and act on the neuronal level. We show that following synaptic deletion, synaptic compensation can be carried out efficiently by a local, dynamic mechanism, where each neuron maintains the profile of its incoming postsynaptic current. Our results open up the possibility that the primary factor in the pathogenesis of cognitive deficiencies in Alzheimer‘s disease (AD) is the failure of local neuronal regulatory mechanisms. Allowing for neuronal death, we observe two pathological routes in AD, leading to different correlations between the levels of structural damage and functional decline. 1 Introduction
Alzheimer’s disease (AD) is characterized by progressive deterioration of the patient’s cognitive and social capabilities. Recent investigations have shown that in addition to the traditionally described plaques and tangles found in the AD brain, this disease is characterized by considerable synaptic pathology. There is significant synaptic loss in various cortical regions, termed synaptic deletion, accompanied by synaptic cornpensation, an increase of the synaptic size reflecting a functional compensatory increase of synaptic efficacy (Bertoni-Freddariet al. 1990; DeKosky and Scheff 1990; Scheff et al. 1993). The combined outcome of these counteracting synaptic degenerative and compensatory processes can be evaluated by measuring the total synaptic area per unit volume (TSA), Neural Computation 8, 1227-1243 (1996) @ 1996 Massachusetts Institute of Technology
D. Horn, N. Levy, and E. Ruppin
1228
which is initially preserved but decreases as the disease progresses. The TSA has been shown to strongly correlate with the cognitive function of AD patients (DeKosky and Scheff 1990; Terry et al. 1991; Masliah ef 01. 1994; Masliah and Terry 1994), pointing to the important role that pathological synaptic changes play in the cognitive deterioration of AD patients. In this paper, we further develop our previous studies of the functional effects of these synaptic changes, and discuss their relation to more traditional neuropathological markers of AD. Since memory deterioration is a clinical hallmark of AD, it is natural to investigate the effects of synaptic deletion and compensation on the performance of an associative memory neural model. Motivated by the findings of synaptic pathology in AD, we previously studied (Horn et 01. 1993; Ruppin and Reggia 1995) ways of modifying the remaining synapses of an associative memory network undergoing synaptic deletion, such that its performance will be maintained as much as possible. To this end, we used a biologically motivated variant of a Hopfield-like attractor neural network (Tsodyks and Feigel’man 1988): M memory patterns are stored in an N-neuron network via a Hebbian synaptic matrix, forming fixed points of the network dynamics. The synaptic efficacy I,, between the jth (presynaptic) neuron and the ith (postsynaptic) neuron in this network is
I,, =
l N
XW‘,- p)(7/”,- p)
(1.1)
/1=1
where TI/'^ are (0,l) binary variables representing the stored memories and
p is the activity level of each memory. The updating rule for the state V, of the ith neuron is given by
+
rN
I,#
V,(t 1)= 0 CJ,,V,(f, - T
1
1
where 0 is the step function and T, the threshold, is set to its optimal value T = i p ( l -p)( 1- 2 p ) . In the intact network, when memory retrieval is modeled by presenting an input cue that is sufficiently similar to one of the memory patterns, the network flows to a stable state identical with that memory. Performance of the network is defined by the average recall of all memories. The latter is measured by the overlap mlL,which denotes the similarity between the final state V the network converges to and the memory pattern rikL that is cued in each trial (see Horn et al. 1993). Synaptic deletion has been carried out by randomly removing a fraction d of all synaptic weights, such that a fraction 7u = 1 - d of the synapses remains. Synaptic compensation was modeled by multiplying all remaining synaptic weights by a common factor c. Varying c as a function of d specifies a compensation strategy. Correlating synaptic size with synaptic strength, we have interpreted TSA in the model as proportional to c . w. In this framework we have previously shown that maintaining
Neuronal-Based Synaptic Compensation
1229
the premorbid levels of TSA (that is, employing c = l/w)constitutes an optimal compensation strategy, which maximally preserves performance as synapses are deleted. However, such uniform, global strategies suffer from two fundamental drawbacks: 1. They can work only when neurons in the network undergo a similar, uniform, process of synaptic deletion. Otherwise, it is advantageous for the different neurons, which undergo different levels of synaptic deletion, to develop their own suitable compensation factors. 2. While global compensation mechanisms may be carried out in biological networks via the actions of neuromodulators, their biological realization remains problematic, since it requires the explicit knowledge of the ongoing level of synaptic deletion in the whole network. In this paper we present a solution to these problems by showing that synaptic compensation can be performed successfully by local mechanisms: a fraction d, of the input synapses to each neuron i is deleted, and is compensated for by a factor c, that each neuron adjusts individually. This is equivalent to performing the replacement I,, c,zu,,JIl where wli is either 0 or 1, and w,= 1 - d, = El w,/N.Our method is based on the neuron’s post-synaptic potential k,, and does not require the explicit knowledge of either global or local levels of synaptic deletion. The local compensatory factor c, develops dynamically so as to keep the membrane potential and neural activity at their original, premorbid levels. The proposed neuronal activity-dependent compensation modifies all the synapses of the neuron concomitantly, in a similar manner, and thus differs fundamentally from conventional Hebbian synaptic activitydependent modification paradigms like long-term potentiation and longterm depression, which modify each synapse individually. Our proposal is that while synaptic activity-dependent modification plays a central role in memory storage and learning, neuronal-level synaptic modifications serve to maintain the functional integrity of memory retrieval in the network. Several biological mechanisms may take part in neural-level synaptic modifications that self-regulate neuronal activity (see van Ooyen 1994 for an extensive review). These include receptor up-regulation and downregulation (Turrigiano et al. 1994), activity-dependent regulation of membranal ion channels (Armstrong and Montminy 1993; Abbott et at. 1994), and activity-dependent structural changes that reversibly enhance or suppress neuritic outgrowth (Mattson and Kater 1989; Schilling et al. 1991; Grumbacher-Reinert and Nicholls 1992). Interestingly, while neurotransmitters’ application may act in isolation on individual dendrites, membrane depolarization simultaneously regulates the size of all growth cones and neurites of a given neuron (Stuart and Sakmann 1994). Taken together, these findings testify that there exist feedback mechanisms that act on the neuronal level, possibly via the expression of immediate early -+
1230
D. Horn, N. Levy, and E. Ruppin
genes (Morgan and Curran 1991), to ensure the homeostasis of neuronal activity. These mechanisms act on a slow time scale and are active also in the normal adult brain. These biological data have recently triggered the computational study of feedback regulation of neuronal dynamics (Abbott and LeMasson 1993; LeMasson et 01. 1993), and activity-dependent network development (van Ooyen and van Pelt 1994). In this paper, we extend the study of neuronal activity regulation to investigate the role of local compensation mechanisms in the pathogenesis of AD. We raise the possibility that synaptic compensatory mechanisms, that in normal aging succeed in preserving a considerable level of cognitive functioning, are disrupted in AD. To study this further, we concentrate on synaptic weight modification, where each weight is taken to represent the functional efficacy of the synapse, i.e., its size and the activity of related receptors and ion channels.’ In the next section, we present local compensation algorithms in two classes of associative memory models. First, in the framework of the Tsodyks-Feigel’man (TF) model where we have previously studied global strategies, and then in the framework of the Willshaw model (Willshaw rt nl. 1969). These models are representatives of two fundamentally different classes of associative networks, differing in the characteristics of the neurons’ mean postsynaptic potential and the level of competitiveness in the network. This distinction has important biological ramifications, but as the pertaining experimental data are currently insufficient to decide which class is biologically more plausible, we study local synaptic compensation in both frameworks. Computer simulations, and their possible clinical implications, are presented in Section 3. The synaptic changes are obviously only part of a complex and interrelated set of neuropathological changes that take place in AD. In Section 4 we briefly discuss these alterations and their relations to the synaptic processes we model. Finally, in the last section we summarize our results and their relevance to understanding the pathogenesis and progression of Alzheimer’s disease. 2 Locally-driven Synaptic Compensation
2.1 The Tsodyks Feigel’man Model. Our local compensation method aims at maintaining the premorbid profile of the postsynaptic potential. In our previous work (Horn rt a / . 1993) it was shown that this profile can be maintained through TSA coiisrrzwfioii, i.e., by using the compensation i = 1 / z . Guided by this finding, we now set to implement its local compensation version, c, = 1/w,. For this purpose we employ the differential equation
’From a strict computational point of view, the synaptic modifications we study are equi\~alentto \variations in the firing threshold o f each neuron.
Neuronal-Based Synaptic Compensation
1231
where K is a rate parameter and w,is a field-dependent estimate of the local connectivity w,.This equation is then transformed to a difference equation, which is used in the simulations
+
c,(t At)
= c,(t)
+ r c , ( t ) [l
-
w,c,(t)]
(2.2)
where r = t d t . We are looking for an estimate GIthat depends only on information that is available to the single neuron. We propose using moments of the neuronal input field (postsynaptic potential) h,, after averaging over a set of retrieval trials, and comparing them with their values in the normal, premorbid state. From a biological perspective, such knowledge and computational algorithms may be prewired in the neuronal regulatory mechanisms reviewed in the previous section, which are responsible for homeostasis of neural activity. There are two possible measurements of the field h,, either under conditions of random noise input or through a set of trials of memory retrieval from the existing repertoire of memories2 In the TsodyksFeigel’man model, the first moment of the field vanishes, (h,) = 0, both for random inputs and memories. So we turn to the second moment, which can be calculated for random noise initial conditions in the premorbid state
When using a set of memories instead of random noise we obtain a different expression that separates into signal and noise terms with different power dependence on deletion
+
( h , ’ ( ~ , )= ) cI2G; (S,’) cI2G,(R,’)
Here (SJ2)is the signal term in the premorbid state (w = 1) and ( R J 2 )is the same noise term as in 2.3. Given these two equations one can solve for Zu, either by using noise alone, or by using trials of memory retrieval and relying on the separate knowledge of the premorbid magnitude of the signal and noise terms. To perform local synaptic compensation in our simulations, we proceed in small steps of deletion A d . At each deletion step the network is presented with all (slightly corrupted) memories and allowed to converge to its fixed points. By averaging the field strengths measured over all memory retrieval trials, we calculate via equation 2.4. Thereafter, synaptic compensation via algorithm 2.2 is applied. The resulting performance is evaluated by presenting all memory cues again. As demonstrated in Figure 1, dynamic local compensation via algorithm 2.4 works as well as or even better than the local TSA-conserving compensation c, = l/w,.
w,
’One may speculate that the biological realization of such field measurements occurs during dreaming.
D. Horn, N. Levy, and E. Ruppin
1232
1
-
\
1
a, 0.8
0 C
0.6-
L
0
5 0.4 CL
0.2-
1
I
! I
I I
I 1 1
1 1.
'.
Figure 1: Performance versus deletion for a network that runs the local TFcompensation algorithm \ria 2.2 and 2.4. The result, presented by a solid line, is compared with the performance of local TSA conserving method (c, = l/w1, dashed curve) and no compensation (cI = 1, dot-dashed curve). The simulation parameters were N = 1000.M = 100. = 0.1. T = 0.25. Ad = 0.01. 2.2 The Willshaw Model. A simpler, and perhaps more biologically plausible mechanism of local synaptic compensation arises in the Willshaw model (Willshaw ct nl. 1969), where memory patterns are stored in excitatory synapses through the rule
(2.5) The updating rule is similar to equation 1.2 and each neuron has a uniform threshold T smaller than 1. In the Willshaw model, unlike the TF model, spurious states with high activity emerge as deletion proceeds. These deviations of the Willshaw network activity level from its premorbid values rule out the possibility of accurately estimating the connectivity z o in a manner analogous to the way equations 2.3 and 2.4 were used in the TF model. However, in the Willshaw model (h,) # 0 so instead of estimating the connectivity from moments of the field and using it for the compensation algorithm as in equation 2.1, we can now use the changes in the field itself to correct for the effects of synaptic deletion directly,
When the dynamics remain similar to those o f the intact network, this method is close to the TSA-conserving strategy (c, = l / z u , ) . However,
Neuronal-Based Synaptic Compensation
1233
Figure 2: Performance versus deletion for a network that runs the local Willshaw-compensation algorithm 2.6. The result, presented by a solid line, is compared with the performance of Iocal TSA conservation (c, = 1/zo,,dashed curve) and no compensation (c, = 1, dot-dashed curve) methods. The simulation parameters were N = 1500,M = 75.p = 0.05.7 = 0.1. Ad = 0.009. as demonstrated in Figure 2, the new, direct, field-dependent method (the discretized version of equation 2.6) is markedly superior to the TSAconserving strategy (c, = l/w,) as changes in the level of activity of the network occur; the high-activity spurious stable states that emerge as deletion proceeds are suppressed by using compensation values that are lesser than those dictated by a TSA-conserving algorithm (which is of course insensitive to the level of activity in the network), resulting in better performance. This simple and efficient algorithm is used in the next section to study the effect of synaptic deletion and compensation rates on network performance, and its relevance to AD progression. 3 Results and Clinical Relevance 3.1 Compensation Rates and AD Progression. In this section we will discuss results of simulations of a Willshaw network of N = 1500 neurons in which M = 75 randomly generated memory patterns with activity p = 0.05 are stored with T = 0.8. In every simulation run, a sequence of synaptic deletion and compensation steps is executed, and the performance of the network is traced as deletion progresses. In each simulation step (considered as one time unit) a fraction Ad of the remaining synapses is deleted. Synaptic compensation is performed via the dis-
1234
D. Horn, N. Levy, and E. Ruppin
cretized version of algorithm 2.6 by averaging the local field following the presentation of the stored memories. We first study the network’s performance at various compensation rates T , as presented in Figure 3a. The performance level is better maintained if the compensation rate is high. As reviewed in Horn ef al. (1993), young and very old AD patients suffer from rapid clinical deterioration characterized by a sharp decline, while the majority of AD patients have a more gradual pattern of decline. These clinical patterns may arise because very old patients have almost no compensation resources (that is, corresponding to low compensation rates illustrated in the leftmost curve in Fig. 3a), and young patients still have potent synaptic compensation mechanisms (the rightmost curve). Interestingly, studies of reactive synaptogenesis following experimental hippocampal deafferentation lesions in rodents show indeed that the rnte of compensatory synaptogenesis decreases as a function of age (Cotman and Anderson 1988, 1989). The dependence of performance on compensation rate i for a given deletion leid d is demonstrated in Figure 3b; while the performance levels obtained in early stages (d = 0.4) are almost similar for a broad range of T values, a more pronounced T dependence is observed as deletion proceeds (d = 0.8). To examine how the rate of synaptic deletion affects performance, we kept the compensation rate T constant and varied Ad. The results are basically similar to the results displayed in Figure 3a; the performance decreases as the rate of deletion increases. Hence, when there is a significant “mismatch” between synaptic deletion and compensation, whether its origin is increased synaptic deletion or decreased compensation rates, the network’s performance degrades. In the intermediate range, the network‘s performance degrades in a fairly gradual manner. Thus, when local compensation is employed, a single compensation mechanism can give rise to a variety of clinically observed patterns of degradation. 3.2 Memory Vacillations. Similar simulations performed within the framework of the TF model yield qualitatively similar results. However, there are two important differences: First, in the TF model performance degradation is homogeneous, that is, all memories have similar retrieval acuity. In the Willshaw model, the retrieval of some memories may decline while others are preserved. Furthermore, while in the TF model once a memory pattern vanishes it is lost forever, in the Willshaw model memory patterns that are lost may later be adequately retrieved due to ongoing compensation. Figure 4 traces the temporal evolution of individual overlaps of four memory patterns in a Willshaw network during deletion and compensation. While some patterns may vanish forever (left uppermost figure), the retrieval of others may vacillate (right uppermost and left lower figures), and some may even have late revival (right lowermost figure). These results demonstrate that computational studies of brain pathologies may potentially enable us to learn more about the
Neuronal-Based Synaptic Compensation
"0
0.2
0.4
1235
0.6
0.8
Figure 3: (a) Performance versus deletion for different compensation rates. r is increased from left to right, with values 0.01.0.025.0.03,0.05.0.1. Ad = 0.009. (b) Performance versus T for fixed d values.
D. H o r n , N. Levy, a n d E. R u p p i n
1236
1
0.5
"0
0.5
1
0.5
1
Figure -1: Overlaps of individual patterns. The simulation parameters w e r e Ad = 0.1115 a n d T = 0.05.
working of the intact brain; in this example, the TF model and the Willshaw model give rise to inherently different patterns of individual memory decline as AD (and normal aging) progresses. A detailed prospective psychological study of individual long-term memory retrieval is called for.' 3.3 Two Routes to Dementia. The pathological synaptic changes in Alzheimer's disease are accompanied by the eventual loss of about 1@-20°%of cortical neurons (Katzman 1986; Masliah 1995). Motivated by developmental studies showing death of hypoactive neurons [see van Ooyen (1994) for a review], and the assumed degeneration of hypoac'An intuiti1.e insight into the different behaviors of the TF and Willshaw models may be obtained by noting that in the Willshaw model all foreground neurons belonging to a given memory (those that should fire) ha\re similar input fields, and hence tend to be activated together. In the TF model, on the other hand, the distribution of input ficlds of such foreground neurons is broad, hence memory patterns can still be partially rctricvcd even if some foreground neurons become quiescent.
Neuronal-Based Synaptic Compensation
1237
tive neurons in AD (Bowen et al. 1994), we incorporate a neuronal degeneration rule that kills neurons as their input field decreases below a viability-threshold (VT) value. In addition we set upper bounds on the neuronal compensation factors. Both viability thresholds and compensation bounds may vary within certain limits over the neural population. Figure 5a and b illustrates the differential effect of high versus low neuronai viability thresholds: Obviously, different viability thresholds lead to distinct resiliency of the network to damage. But more interesting, they also give rise to different relations between the level of neuronal death and network performance. In general, there are two principal pathological pathways in which performance collapses in the network as deletion proceeds: 1. Synaptic loss, i.e., a strong decrease in the synapse/neuron ratio. This requires low viability thresholds, and may lead to Iarge cognitive deficits with little structural damage. 2. Neuronal loss, which is expected to occur for higher values of viability thresholds. This will generally cause a faster avalanche of the disease, once it starts to take place, and significant neural death. A qualitatively similar effect may be seen with low versus high compensation rates. Since the primary factor responsible for cortical atrophy in AD is likely to be neural degeneration [synapses occupy a very small fraction of the cortical volume (Bourgeois and Rakic 1993)], the finding that the extent of neural damage depends on the pathological pathway may shed some light on the broad range of cortical atrophy levels observed in AD patients, suffering from similar levels of cognitive deterioration (Wippold et al 1991; Murphy et a / . 1993). 4 Plaques, Tangles, and Synaptic Pathology
In addition to the neurodegeneration of the association and limbic cortices, the two main neuropathological alterations that accompany the progression of AD are neuritic plaques (composed mainly of degenerating neurites and amyloid) and neurofibrillary tangles (composed mainly of microtubule associated tau protein) (Terry et a/. 1994). Recent experimental observations offer new insights regarding the relationship between plaques and tangles and synaptic alterations in AD: 0
Amyloid plaques: The extracellular deposition of amyloidogenic plaques is likely to play an important role in neural and synaptic degeneration [see Masliah and Terry (1994) and Masliah (1995) for a review]. However, the existence of widespread neuritic dystrophy that is not directly associated with amyloid deposits (Selkoe 1994), and the observation that plaque and tangle formation can account only partially for synaptic pathology, suggest that there is an additional, primary, synaptic pathogenic process in AD (Masliah and Terry 1994; Masliah 1995). This process may result from the
E. Ruppin
I).Horn, N. Levy, and
1238
10
0
20
t
0.10,
O
.
0.000
20
I
40 t
0
60
.
,
,
.
30
50
40
t ,
,
,
,
.
,
,
~
80
Figure 5: Performance versus time with (a) a low viability threshold (VT = 0.2) and (b) a high viability threshold (V T = 0.5). T = 0.02 and the rest of the simulation parameters are as in Figure 4. The bottom figures trace the fraction of dead neurons as a function of time. Neuronal damage is traced until complete performance collapse (the patient’s “death”), i.e., zero overlap.
dysfunction of several synaptic proteins, including impaired amyloid precursor protein (Roch et nl. 1994; Masliah 1995). It remains to be determined if these synaptic regulation abnormalities manifest themselves primarily in enhanced synaptic degeneration or in attenuated rates of synaptic regeneration and compensation. Tangles: The main common underlying pathology for a wide variety of neuropil abnormalities in AD is the accumulation of the microtubule-associated protein Tau in neurofibrillary tangles (Perry ef al. 1991; McKee et a/. 1991). Neuronal-level synaptic compensation, involving early gene expression and protein transcription, probably requires intact cellular transport systems. Hence, disrupted microtubule systems may lead to deficient synaptic com-
Neuronal-Based Synaptic Compensation
0
1239
pensation. While neurofibrillary pathology is likely to contribute to dementia in AD (Arrigada et al. 1992; Samuel et al. 1994), there is a subgroup of AD patients that shows little neurofibrillary pathology and yet may suffer from considerable cognitive deterioration (Terry et al. 1987). If indeed damaged synaptic compensation arises from neurofibrillary pathology, then the dementia in these no-tangles subgroups is predicted to arise primarily from excessive synaptic pruning (i.e., markedly reduced synaptic density, which may be observed in morphometric studies) in face of maintained TSA (which can be measured both by morphometric and immunohistochemical techniques). Plaques and tangles may not only cause synaptic pathology, but, in turn, synaptic alterations may enhance plaque and tangle formation, yielding a “vicious cycle” of neural damage and death; abberant “compensatory” synaptic sprouting may enhance neuro degeneration (Masiliah et al. 1991; Cotman et al. 1991) and plaque formation (Masliah et al. 1992). The functional effects of sprouting on network performance are important and interesting topics for future computational studies; interestingly, previous investigations in associative neural models have shown that increased synaptic noise may have beneficial as well as adverse effects (Horn and Ruppin 1995).
In summary, synaptic pathology may either be a direct result of an underlying molecular defect affecting synapses, or a secondary result of neural loss, and plaque and tangle formation. Given the current state of knowledge of the mechanisms underlying AD pathology, this issue is an open question, and our model is still not rich enough to address it. The development of combined neural/metabolic models to study the interplay between synaptic alterations and more conventional markers of AD pathology is called for. Such models may shed further light on the relative significance of the two routes to dementia delineated in this work.
5 Summary We have shown that synaptic compensation, a process that appears to play an important role in attenuating the progression of AD, can be achieved in a stable manner via local, activity-driven mechanisms. The biologically motivated mechanisms introduced in this paper act to maintain neuronal homeostasis. Within our model, the variation of a single parameter, the compensation rate, describes the different progression rates of cognitive deterioration observed in AD. Our work points to the possible important role of synaptic compensation failure in the progression of Alzheimer’s disease. This failure probably reflects a breakdown of regulatory mechanisms that play a part
D. Horn, N. Levy, and E. Ruppin
1240
in maintaining the functional integrity of the aging, notidemented, brain (Buell and Coleman 1979; Flood and Coleman 1986; Bertoni-Freddari eta!. 1988, 1990). We have based our compensation model on the observation of increased synapses in the aged and demented brains. considerable support to the functiotzal significance of structural synaptic compensatory changes has been furnished by electrophysiological studies in the aging rodent hippocampus [see Barnes (1994) for a review], indicating that older rats have fewer, but structurally larger and functionally stronger synapses. Recently, it was also shown that the infusion of nerve growth factor in aged rats causes a significant increase in the TSA per unit volume of cortex, which is correlated with improved cognitive performance (Chen t 7 t nl. 1995). This gives further support to the main mechanism that we propose. We have discussed the existence of two pathological routes of damage in AD, "synaptic" and "neural" ones. Severe cognitive deterioration may occur via either route, but the neural route leads to considerably more cortical atrophy than the synaptic route, while causing similar levels of cognitive deterioration. We hypothesize that the profile of pathological routes taken in a specific AD patient depends on the distribution of his neuronal viability thresholds. In addition to the development of combined neural/metabolic models, our work can be extended in the future in a few directions. We have studied so far global and local synaptic compensation methods. Other intermediary synaptic compensation methods may be of interest. The utilization of higher moments of the neuron's input field may enable the realization of more efficient synaptic compensation regimes, that can counteract the pathological effects of nonuniform, binserf, synaptic deletion. We may even envisage the possibility of using compensation algorithms as applications to the generation of "self-maintaining" associative memory network chips. Finally we wish to reiterate the point made in the introduction: Our key idea is the existence of a rieiirorial activity-dependent compensation mechanism. This differs from Hebbian syunptic modification, which plays a central role in memory storage and learning. Our proposal is that neuronal-level synaptic modifications serve to maintain the functional integrity of memory retrieval in the network. Both processes are therefore needed for the proper functioning of the brain.
References
._ -
Abbott, L. F., and LeMasson, G. 1993. Analysis of neuron models with dynamically regulated conductances. Npirrnl Cmtip. 5, 823-842. Abbott, L. F., Turrigiano, G., LeMasson, G., and Marder, E. 1994. Activitydependent conductances in model and biological neurons. In Nntirrnl nmf
Neuronal-Based Synaptic Compensation
1241
Artificial Parallel Cornpicting: Proceedings of Fifth Annual NEC Researck Symposium, D. Waltz, ed. SIAM, New York. Armstrong, R. C., and Montminy, M. R. 1993. Transsynaptic control of gene expression. Annu. Rev. Neurosci. 16, 17-29. Arrigada, I? V., Growdon, J. H., Hedley-Whyte, E. T., and Hyman, B. T. 1992. Neurofibrillary tangles but not senile plaques parallel duration and severity of Alzheimer’s disease. Neurology 42, 631-639. Barnes, C. A. 1994. Normal aging: Regionally specific changes in hippocampal synaptic transmission. TINS 17(1), 13-18. Bertoni-Freddari, C., Meier-Ruge, W., and Ulrich, J. 1988. Quantitative morphology of synaptic plasticity in the aging brain. Scanning Microsc. 21027-1034. Bertoni-Freddari, C., Fattoretti, P., Casoli, T., Meier-Ruge, W., and Ulrich, J. 1990. Morphological adaptive response of the synaptic junctional zones in the human dentate gyrus during aging and Alzheimer’s disease. Brain Res. 51769-75. Bourgeois, J. P., and Rakic, P. 1993. Changing of synaptic density in the primary visual cortex of the Rhesus monkey from fetal to adult age. I. Neurosci. 13:2801-2820. Bowen, D. M., Francis, P. T., Chessell, I. P., and Webster, M. T. 1994. Neurotransmission-the link integrating Alzheimer research? TINS 17(4):149-150. Buell, S. J., and Coleman, P. D. 1979. Dendritic growth in the aged human brain and failure of growth in senile dementia. Science 206954456, Chen, K. S., Masliah, E., Mallory, M., and Gage, F. H. 1995. Synaptic loss in cognitively impaired aged rats is ameliorated by chronic human nerve growth factor infusion. Neuroscience 68(1):19-27. Cotman, C. W., and Anderson, K. J. 1988. Synaptic plasticity and functional stabilization in the hippocampal formation: Possible role in Alzheimer’s disease. Adu. Neurol. 47:313-336. Cotman, C. W., and Anderson, K. J. 1989. Neural plasticity and regeneration. In Basic Neirrochemistry: Molecular, Cellirlar and Medical Aspects, G. J. Siege1 et al., eds., pp. 507-522. Raven Press, New York. Cotman, C. W., Cummings, 8. J., and Whitson, J. S. 1991. The role of misdirect plasticity in plaque biogenesis and Alzheimer’s disease pathology. In Grozoth Factors and Alzheimer’s Disease, F. Hefti, P. Brachet, B. Will, and Y. Christen, eds., pp. 222-233. Springer-Verlag, New York. DeKosky, S. T., and Scheff, S. W. 1990. Synapse loss in frontal cortex biopsies in Alzheimer’s disease: Correlation with cognitive severity. Ann. Neurol. 27(5), 457464. Flood, D. G., and Coleman, P. D. 1986. Failed compensatory dendritic growth as a pathophysiological process in Alzheimer’s disease. Can. 1. Neurol. Sci. 13, 475479. Grumbacher-Reinert, S., and Nicholls, J. 1992. Influence of substrate on reduction of neurites following electrical activity of leech Retzius cells in culture. J. Exp. Biol. 167, 1-14. Horn, D., and Ruppin, E. 1995. Compensatory mechanisms in attractor neural network model of schizophrenia. Neurul Cornp. 7(1), 182-205. Horn, D., Ruppin, E., Usher, M., and Herrmann, M. 1993. Neural network
1242
D. Horn, N. Levy, and E. Ruppin
modeling of memory deterioration in Alzheimer‘s disease. Ncirrnl C O ~ U5,~ . 736-749. Katzmann, R. 1986. Alzheimer’s disease. N. EtigI. 1. M d . 314, 964-973. LeMasson, G., Marder, E., and Abbott, L. F. 1993. Actix.itv-dependent regulation of conductances in model neurons. Scicwcc 259, 1915-1917. Masliah, E. 1995. Mechanisms of synaptic dysfunction in Alzheimer’s disease. Hi.stcii. iiiiil Hi~toptiioi.10, 509-519. Masliah, E., and Terry, R. 1994. The role of synaptic pathology in the mechanisms of dementia in alzheimcr’s disease. Cliti. N r i r r o ~ i .1, 192-198. Masliah, E., Mallor); M., I-Iansen, L., Alford, M., Terry, R., Baudier, J., and Saitoh, T. 1991. Patterns of aberrant sprouting in Alzheimer’s disease. NL’II~OI? 6, 72v-739.
Masliah, E., Ellisman, M., Carragher, B., Mallory, M., Young, S., Hansen, M. D., DeTeresa, R., and Terr!; R. D. 1992. Three-dimensional analysis of the relationship betfireen synaptic pathology and neuropil threads in Alzheimer disease. J . N~~irroptlial. E.xj7. Ncirrcil. 51(4), 1 0 4 4 1 4 . Masliah, E., Mallory, M., Hansen, L., DeTeresa, R., Alford, M., and Terry, R. 1994. Synaptic and neuritic alterations during the progression of Alzheimer’s disease. Nwrosc-i. Lctt. 174, 67-72, Mattson, M. P., and Kater, S. B. 1989. Excitatory and inhibitory neurotransmitters in the generation and degeneration of hippocampal neuroarchitecture. P t . ~ i t Ri’s. ~ . 478, 337-348. McKee, A . C., Kosik, K. S., and Kosik, N. W. 1991. Neuritic pathology and dementia in Alzheimer‘s disease. Airti. Ncwml. 30(l), 56-65, Morgan, J . I., and Curran, T. 1991. Stimulus-transcriFtion coupling in the nerw u s system: Involvement of inducible protn-oncogenes fos and jun. Atinif. l i ~Nr,urosri. . 14,421451. Murphy, D. G., DeCarli, C D., Daly, E., Gillette, J. A , , Mclntosh, A. R., Haxby, J. V., Teichherg, D., Schapiro, M. B., R a p p o r t , S. I., and Horwitz, B. 1993. Volometric magnetic resonance imaging in men with dementia of the Alzheimer type: Correlations with disease sex,erity. B ~ c JPs!/c/liot. ~. 34(9), 612-621. I’erry, C., Kawai, M., and Tabaton, M. 1991. Neuropil threads of Alzheimer’s disease shoiv a marked alteration of normal cytoskeleton. J. Neirrosci. 11, 174-1755, Roch, J. M., Masliah, E., Roch-Leveco, A,-C., Sundsmo, M. P., Otero, D. A. C., Veinberg, I., and Saitoh, T. 1994. Increase of synaptic density and memory retention by a peptide representing the trophic domain of the amyloid j/f74 protein precursor. Proc Not/. Acnd. Sci. U.S.A. 91, 7450-7454. Ruppin, E., and Reggia, J. 1995. A neural model of memory impairment in diffuse cerebral atrophy. Br. 1. Ps!/c/iint. 166(I), 19-28. Samuel, W., Terry, R., DeTeresa, R., Butters, N., and Masliah, E. 1994. Clinical correlates of cortical and nucleus basalis pathology in Alzheimer dementia. Ardi. Nmro. 51, 772-778. Scheff, S. W., Sparks, D. L.., and Price, D. A. 1993. Synapse loss in the temporal lobe in Alzheimer’s disease. Atiiz. Neiirol. 33, 190-199. Schilling, K., Dickinson, M. H., Connor, J. A, , and Morgan, J. 1. 1991. ElectriI
Neuronal-Based Synaptic Compensation
1243
cal activity in cerebral cultures determines Purkinje cell dendritic growth patterns. Neuron 7:891-902. Selkoe, D. J. Normal and abnormal biology of the /3-amyloid precursor protein. Annu. Rev. Neurosci. 17, 489-517. Stuart, G. J., and Sakmann, 8.1994. Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature (London) 367, 69-72. Terry, R. D., Hansen, L. A,, DeTeresa, R., Davies, P., Tobias, H., and Katzman, R. 1987. Senile dementia of the Alzheimer type without neocortical neurofibrillary tangles. 1. Neuropathol. Exp. Neurol. 46, 262-268. Terry, R. D., Masliah, E., Salmon, D. P., Butters, N., DeTeresa, R., Hill, R., Hansen, L. A., and Katzman, R. 1991. Physical basis of cognitive alterations in Alzheimer’s disease: Synapse loss is the major correlate of cognitive impairment. Ann. Neurol. 30:572-580. Terry, R. D., Hansen, L., and Masliah, E. 1994. Structural alterations in Alzheimer’s disease. In Alzheimer’s Disease, R. D. Terry, ed., pp. 179-196. Raven Press, New York. Tsodyks, M V., and Feigel‘man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105. Turrigiano, G., Abbott, L. F., and Marder, E. 1994. Activity-dependent changes in intrinsic properties of cultured neurons. Science 264, 974-977. van Ooyen, A,, and van Pelt, J. 1994. Activity-dependent outgrowth of neurons and overshoot phenomena in developing neural networks. 1. Theor. Biol. 167, 2743. van Ooyen, A. 1994. Activity-dependent neural network development. Nefmork 5, 401423. Willshaw, D. J., Buneman, 0. P., and Longuet-Higgins, H. C. 1969. Nonholographic associative memory. Nature (London) 222, 960-962. Wippold, F. J., Gado, M. H., Morris, J. C., Duchek, J. M., and Grant, E. A. 1991. Senile dementia and healthy aging: A longitudinal CT study. Radiology 179(1), 215-219.
Received August 28, 1995; accepted January 12, 1996.
This article has been cited by: 2. Thomas Arendt. 2009. Synaptic degeneration in Alzheimer’s disease. Acta Neuropathologica 118:1, 167-179. [CrossRef] 3. Andre Fischer, Farahnaz Sananbenesi, Xinyu Wang, Matthew Dobbin, Li-Huei Tsai. 2007. Recovery of learning and memory is associated with chromatin remodelling. Nature 447:7141, 178-182. [CrossRef] 4. Bruce Teter, J. Wesson Ashford. 2002. Neuroplasticity in Alzheimer's disease. Journal of Neuroscience Research 70:3, 402-437. [CrossRef] 5. Asnat Greenstein-Messica , Eytan Ruppin . 1998. Synaptic Runaway in Associative Networks and the Pathogenesis of SchizophreniaSynaptic Runaway in Associative Networks and the Pathogenesis of Schizophrenia. Neural Computation 10:2, 451-465. [Abstract] [PDF] [PDF Plus] 6. David Horn , Nir Levy , Eytan Ruppin . 1998. Memory Maintenance via Neuronal RegulationMemory Maintenance via Neuronal Regulation. Neural Computation 10:1, 1-18. [Abstract] [PDF] [PDF Plus]
Communicated by Misha Maholwald
Spike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in Dendrites David P. M. Northmore John G. Elias Departments of Psychology and Electrical Engineering, University of Delaware, Newark, DE 29716 U S A A dendritic tree, as part of a silicon neuromorph, was modeled in VLSI as a multibranched, passive cable structure with multiple synaptic sites that either depolarize or hyperpolarize local ”membrane patches,” thereby raising or lowering the probability of spike generation of an integrate-and-fire “soma.“ As expected from previous theoretical analyses, contemporaneous synaptic activation at widely separated sites on the artificial tree resulted in near-linear summation, as did neighboring excitatory and inhibitory activations. Activation of synapses of the same type close in time and space produced local saturation of potential, resulting in spike train processing capabilities not possible with linear summation alone. The resulting sublinear synaptic summation, as well as being physiologically plausible, is sufficient for a variety of spike train processing functions. With the appropriate arrangement of synaptic inputs on its dendritic tree, a neuromorph was shown to discriminate input pulse intervals and patterns, pulse train frequencies, and detect correlation between input trains. 1 Introduction
The goal of constructing artificial nervous systems in hardware for generating behavior in real environments can best be approached at the present time by using the techniques of VLSI. Mead (1989) has made the point that certain aspects of CMOS transistor physics naturally lend themselves to the emulation of neural structures and functions. Current technologies allow the construction of high densities of neuron-like elements, or neuromorphs to use Mead’s term, for use in large-scale working systems for robots. Our aim was first to create VLSI neuromorphs that incorporate some of the basic operational principles of real neurons, and second to develop a system for interconnecting thousands of neuromorphs to form networks (Elias 1993; Elias and Northmore 1995). This approach is providing a useful modeling medium for testing ideas about brain function, particularly at the microcircuit and small system level (Elias and Northmore 1996), and at the single neuron level, as in the work described Neural Computation 8, 1245-1265 (1996) @ 1996 Massachusetts Institute of Technology
1246
David I? M. Northmore and John G. Elias
here. Although computer modeling is supremely flexible, and can be made as realistic as needed (e.g., Koch and Segev 1989), hardware models have the advantage of speed and compactness, making them better suited for robots that interact with the real world. While a number of VLSI implementations of neurons and neuronal processing systems have been built (see Douglas et 01. 1995 for review), most have focused on processing specific forms of sensory information, usually visual or auditory. Our approach is to develop ”general purpose” neuromorphs that can be readily used in large-scale networks. Signaling between our neuromorphs, as in much of the biological nervous systems, is by means of impulses. Our electronic system takes full advantage of high-speed digital circuitry for multiplexing impulse signals on a limited number of wires, and, importantly, makes it possible to route a neuromorph’s output along pathways with programmable delays to an arbitrary set of synapses in a network. The utility of such a system, either as a modeling tool or for adapting a network to a practical task, requires that all these aspects of connectivity be easily and rapidly programmable. Neuromorph design is, of necessity, an exercise in the art of the possible. With the objective of large-scale systems in mind, each neuromorph should be compact, dissipate little power, be insensitive to silicon process variations, yet perform with adequate reliability and precision. Its circuit components should be simple, yet allow as much programmability as possible. To meet these requirements, our present silicon neuromorphs (Elias 1993; Elias and Northmore 1995) were designed to embody the characteristics of the classical neuron (Eccles 1957; Rall et al. 1992). Each comprises a spatially extensive, branched dendritic tree that receives and integrates pulsatile inputs at a number of synaptic sites, both excitatory and inhibitory. The resulting postsynaptic potentials diffuse passively to a ”soma” that generates pulsatile outputs. Simplistic though it is, we show that the classical neuron, constructed with linear components for the most part, exhibits a number of useful signal-processing capabilities. Since the time that Rall (1964) showed that a passive dendritic branch could perform spatiotemporal filtering, a body of mostly theoretical work has led to a better understanding of how a neuron’s response to afferent impulse trains is affected by the branching structure of dendritic trees and the spatial disposition of synaptic inputs (for reviews see Rall ef a/. 1992; Me1 1994). Because the same principles apply to our artificial dendritic tree neuromorphs, we are able to demonstrate that these devices can respond selectively to input spike frequencies, discriminate temporal patterns of spikes, and detect correlations between input spike trains, given the appropriate arrangement of synaptic connections upon the dendritic tree. Because of recent interest in synchronous spiking activity in cortical neurons and the possibility that it may provide a solution to the binding problem (Singer and Gray 1995), we also examined the response of
Spike Train Processing
1247
our neuromorph to varying degrees of input synchrony. It might be assumed that a neuron could detect synchronous firing in a set of afferents simply by temporal summation: the more nearly coincident in time the afferent synaptic activations, the larger will be the summated EPSPs. However, a number of factors are likely to combine to make N highly synchronized input spikes less effective in exciting the recipient neuron than N spikes distributed over a longer time period. Recently, simulations (Murthy and Fetz 1994; Bernander et al. 1994) have examined the conditions under which a neuron’s refractoriness following the firing of a spike could limit temporal summation of highly synchronized inputs, making them relatively less effective. In this study, we examine how sublinear synaptic summation in a neuromorph can influence its response to inputs with various degrees of synchrony, and compare these results with those produced by refractoriness alone. We also explore some synaptic arrangements that detect synchrony more efficiently than simple temporal summation. 2 Experimental Hardware System
Our silicon neuron, or neuromorph, comprises an artificial dendritic tree (ADT) (Elias 1993), and a leaky integrate-and-fire “soma” (Fig. lb). The dendritic branches are composed of a series of identical compartments each with a capacitor, C,, representing a membrane capacitance, and two programmable resistors, R, and R,, representing the axial cytoplasmic and membrane resistances. Synapses at each compartment are emulated by a pair of MOS field effect transistors, one of which, the excitatory synapse, enables inward current, moving the potential of the membrane capacitance in a depolarizing direction. The other transistor, emulating an inhibitory synapse, enables outward current, hyperpolarizing the membrane capacitance. There is also a facility for converting alternate inhibitory synapses into shunting or ”silent inhibitory” synapses that pull the membrane potential toward a potential close to the resting value. The potential appearing at the soma (Vs,see Fig. lb) determines the output spike firing rate in conjunction with the integration time constant, RC, and the threshold, Vth. Synapses are activated by a brief impulse signal (50 nsec) applied to the gate terminals of the synapse transistors. Experiments were performed on a single neuromorph consisting of an 8-branch ADT fabricated with a 2-pm CMOS double-poly n-well process on a 2 x 2-mm MOSIS Tiny Chip format. The dendrite dynamics was set by switched-capacitor circuits that emulated R, and R , , to give membrane time constants in the range of 5-20 msec. The dendritic trunk is terminated at a source follower to provide a low impedance input to the on-chip integrate-and-fire spike generator, which has a programmable time constant (Fig. lb). For most experiments, the spike generator had a relatively short RC time constant of 0.75 msec in order to produce rea-
1248
David P. M. Northmore and John G. Elias
Figure 1: (a) A five-compartment segment of artificial dendritic tree (ADT). R,, axial resistance; R,, membrane resistance; C,,, membrane capacitance. Upper row of synapse transistors is excitatory, lower row inhibitory. Alternate inhibitory transistors shunt membrane voltage to V,,,t for a programmable time period. (b) A dendritic tree with spike generating “soma”of the leaky integrateand-fire type. V , is ”membrane voltage” at the soma; R and C form the integrator; Vth, is spike-firing threshold. sonably high spike firing rates. The neuromorph under study was embedded in the “virtual wire system” previously described (Elias 1993). This system allows spikes generated by neuromorphs to be distributed with programmable delays to arbitrary synapses throughout a network. A host computer set the spike firing threshold, Vth, (Fig. lb), and gencrated the spatiotemporal patterns of spikes used as test stimuli for the neuromorph. It also recorded the membrane potential at the soma, V,, and read the ”spike history,” which is part of the virtual wire system, to obtain the times of occurrence of the output spikes generated by the neuromorph. Calculations show that the conductance of the synapse transistors, when activated for 50 nsec, is sufficient to move the potential on the capacitor in the activated compartment almost instantaneously to a level close to the supply voltages, either 5 V for excitatory activation, or 0 V for inhibitory activation. The charges delivered during a synaptic activation depend upon the conductance of the transistor in the on state and the potential on the compartmental capacitor. The diffusion of charge through the dendritic tree leads to a response at the soma that can outlast the activating pulse by up to nine orders of magnitude at the longest membrane time constants. The shape and duration of the synaptic impulse response depend upon dendrite dynamics, which by means of the adjustable resistors R, and R,, is programmable over a range of six orders of magnitude (Elias and Northmore 1995). Excitatory impulse responses typical of those obtained under the prevailing experimental conditions are shown in Figure 2b and f. Inhibitory impulse responses have similar shape, but are negative-going. The amplitude of an impulse response appearing at the soma falls off exponentially with the distance of the actixjated synapse from the soma while its latency to peak is approximately proportional to distance (Jack et al. 1975; Agmon-Snir and Segev 1993;
Spike Train Processing
1249
Elias and Northmore 1995). Hence, the efficacy and delay accorded an input can be controlled to a certain extent by choosing the dendritic locus of synaptic activation. The overall effect at the soma, however, depends not only on its electrotonic distance from the soma but also on the local synapse voltage just prior to activation. This is the basis for the sublinear effect reported in this paper. 3 Results 3.1 Two-Pulse Responses. The classical modes of signal combination in neurons involve the spatial and temporal summation of postsynaptic potentials, and are often thought of as linear processes. Near-linear summation of two inputs can be obtained with the synaptic connections shown in Figure 2a, in which two input spikes, A and B, activate excitatory synapses at two sites that are equidistant from the soma but on different branches of the dendritic tree. The smaller impulse responses in Figure 2b measured at the soma were evoked by spikes A or B delivered separately to their respective synapses. When both A and B were delivered with some intervening time interval (2 msec in Fig. 2b), the resulting combined response (large response in Fig. 2b) was very close to the sum of the individual impulse responses. Figure 2c shows that as the A-B interval lengthened, the integral of the combined response was always twice that of a single impulse response, while the peak amplitude of the combined response declined, an effect that was reflected in the mean rate of spiking of the soma (Fig. 2d). The precise form of the spike output function depends upon the setting of the spike generator threshold and its RC time constant. In the typical case shown in Figure 2d, the neuromorph fired 0-3 spikes on each trial, the number depending upon the A-B interval. This accounts for the tendency to quantization shown by the plateaus on the mean spike response curves. A departure from near-linear summation can be demonstrated by the synaptic connections shown in Figure 2e where spikes A and B depolarize the same dendritic “membrane patch” via excitatory synapses. The combined response is significantly less than the sum of the individual responses for short A-B intervals, both in terms of the integral and peak amplitude (Fig. 2g). As a consequence, the neuromorph’s soma potential and its spike output become tuned to a range of input spike intervals (Fig. 2h). Here the neuromorph fired 0-2 spikes depending on the A-B interval. Raising the spike generator threshold reduces the number of output spikes and increases the sharpness of tuning. 3.2 Dendritic Saturating Sublinearity. Sublinear summation of the type just described is due to a reduction in the local synaptic driving potential in dendrites (Rall 1964; Shepherd and Koch 1990; Koch and Poggio 1992). In our present ADTs, each synaptic activation nearly sat-
David P. M. Northmore and John G. Elias
1250
i;:
t f
M
mvdtr
__
Figure 2: Measurements on a two-branched AD7 neuromorph demonstrating near-linear and sublinear summation. Input pulses, A and 8, activated an excitatory synapse (#6) located 38"/~,along the branch from soma on different branches (a-d), or on the same branch (e-h) of the ADT. (b, f ) Impulse responses measured at the soma evoked by pulses A and B, individually (smaller curves), and together (larger curve). In this example, B followed A by 2 msec. (c, g) Peak impulse response and integral as a function of A-B time interval. 100% denotes response peak or integral to a single input pulse. (d, h) Mean number of spikes generated by the soma as a function of A-B interval.
urates the potential of the activated compartment (Vtop in the case of an excitatory synapse, GND for a n inhibitory, see Fig. l b ) . If, before this potential has fully decayed, a subsequent activation occurs at the same compartment it will deliver less charge than the first a n d a d d cor-
Spike Train Processing
1251
respondingly smaller increments to the potential appearing at the soma. The resultant conditioning of an impulse response by prior activity can be influenced in various ways. The sublinear effects are promoted by ahy condition that slows the decay of potential from the activated dendritic compartment. For example, reducing the leakage of charge through the membrane resistance, by increasing R,, lengthens the optimal interval of the two-spike response (e.g., Fig. 2g and h). Sublinear effects may also be prolonged by reducing the spread of charge along the dendrite by increasing the axial resistance, R,. Thus, all spike-interval and frequencyselective effects due to sublinear summation can be readily tuned in our neuromorphs by varying the dendritic dynamics (Elias and Northmore 1995). Diffusion of charge from a compartment can also be influenced by contemporaneous synaptic activations in neighboring compartments, so that the spatiotemporal arrangement of synaptic activations on dendrites can lead to a number of interesting and useful effects. Some of these will be illustrated by the neuromorphic responses to input spike trains of different frequencies. 3.3 Spike-Frequency Selectivity. Figure 3 shows the results of experiments to demonstrate how an appropriate pattern of synaptic connections confers spike-frequency-selectiveproperties on a single neuromorph, which, for this simple illustrative case, requires two dendritic branches. Figure 3a shows the synaptic connections made by two afferent trains of input spikes, A and B, the spikes being Poisson distributed with mean frequencies varying from 0 to 500 spikes per second. Poisson distributed spike trains were used because they resemble natural spiking behavior more closely than constant frequency trains, although the latter gave qualitatively similar results. Each input spike simultaneously activated the specified synapses on its corresponding dendritic branch. Figure 3b shows the mean soma voltage, V,, as a function of mean input spike frequency for different synaptic connection patterns. When a single excitatory synapse was driven at increasing frequency, V , increased linearly at first and then with ever decreasing rate (curve Al), demonstrating sublinear summation at one dendritic compartment. As neighboring synapses were recruited for simultaneous activation by the train, V, increased in amplitude, but with more pronounced sublinearity (curves A2, A3, A4). This effect is attributable to a decreased potential gradient in the neighborhood of the activated region of dendrite. Similar effects on membrane hyperpolarization are obtained by activating a cluster of the hyperpolarizing inhibitory synapses. A more linear response from a synaptic site can be obtained by increasing the local potential gradient. This is achieved by activating neighboring synapses of opposite polarity, as is illustrated on the lower branch of Figure 3a. The synaptic interaction in this case leads to an essentially linear relationship between V , and input frequency (Fig. 3b, curve B). Note that because the inhibitory synapse is closer to the soma than is
David P. M. Northmore and John G. Elias
1252
(4
lJJllll
-
n
3 0
5
soma voltage
40-
v
30
B /
<,AA3;
-
A1
------
2 3 E 0 01
0
100
200
300
400
500
Figure 3: Measurements on a two-branched ADT neuromorph showing impulse and spike responses evoked by Poisson-distributed spike trains of different frequencies. (a) Synaptic connections of input trains A and B. Open circles are excitatory synapses, filled inhibitory. (b) Mean soma voltage vs. mean frequency of input trains. Curves AlLA4: simultaneous activation of 1, 2, 3, or 4 excitatory synapses (synapses 14-20) on one branch. Curve B: simultaneous activation of one inhibitory and one neighboring excitatory synapse (10, 9). Since the inhibitory synapse is the stronger, curve B is actually negative-going, but plotted positive-going for comparison.
the excitatory synapse, the simultaneous activation of the pair results in net inhibition, so that curve 3 is actually negative but plotted positive for comparison. (The pairing of opposite type synapses could also be used to produce a near-linear increase in net excitation if the excitatory synapse were closer to the soma.) The effect is significant in that a synapse cluster
Spike Train Processing
1253
\
g 0 i
somavoltage
Y
O
O
I
I
I
I
'\---I
100 200 300 400 Mean Input Frequency (spikeshec)
500
Figure 3 (continued): (c) and (d) show responses to simultaneous activation of all synapses shown in (a) by a single input train; (c) mean soma voltage relative to resting potential; (d) mean frequency of soma spike output.
of this type can generate a response that overtakes the sublinear response of a cluster of same-sign synapses, depending upon the two clusters' relative proximity to the soma. Activating two such clusters on different branches of the dendritic tree with the same input train results in nearly linear summation between the two, giving the peaked frequency response of Figure 3c. With the threshold of the spike generator set appropriately, the neuromorph's spike output exhibits a pronounced spike-frequency tuning (Fig. 3d). The tuning characteristics can be altered by rearranging the synaptic connections. For example, by subtracting curve B from curve A2, we can shift the tuning peak to lower frequencies. However, a
David I? M. Northmore and John G. Elias
1254
A
B
Figure 3: (a) Connections for coincidence detection. Two random Poisson trains ( A and B) activate the excitatory (open circles) and inhibitory synapses (filled circles) on three dendritic branches as shown. wider range of tuning is available by adjustment of dendritic dynamics via R, and R",. 3.4 Interactions between Multiple Sources. With an understanding of synaptic interactions, it is possible to design input connections to the dendritic tree to make a neuromorph respond to particular combinations o f input patterns in time and space. The connections shown in Figure 4a deliver two different afferent spike trains to excitatory synapses on separate branches, and to inhibitory synapses on a third, common branch. The neuromorph's response to either train alone will be essentially zero because the excitatory and inhibitory effects sum nearly linearly and cancel. However, if spikes from the two trains arrive synchronously, or nearly 50, the inhibitory responses summate sublinearly, with the result that excitation prevails over inhibition, and the neuromorph responds by firing. Figure 4b shows examples of pairs of random input trains of equal mean frequency. Over the first half of the record, the trains were uncorrelated; over the second half they became correlated and the neuromorph fired at an average rate of 42 spikes/sec. The membrane time constant, 20 msec in this case, determined the latency of build up in firing rate. The contour plot of Figure 5 shows another example of interactions between two sources. The synaptic connections of two Poisson-distributed spiking sources A and €3 shown in Figure 5a are similar to those of Figure 4 in that some of the inhibitory synapses share dendritic sites, whereas none of the excitatories do. When one of the input sources is
Spike Train Processing
A I B I
Ill
1255
I II
I I
A l l I B I I Ill I I
A1 €
3
A B I
0
1
I I 111
I l l I I II I I
II I II I
1 II II
Ill I 1 I II I I I
)
I I
I
1
I I
I l l I I I
I I I II I I I II
i20
Time (msec)
Figure 4: (b) Four samples of trains A, B. For the first 60 msec (left of vertical dotted line), the two trains were uncorrelated; thereafter, train B was generated by randomly jittering the time of occurrence of each spike in train A by up to +1.5 msec. (c) Histogram showing time of occurrence of resulting neuromorph output spikes accumulated over 100 trials. On the right side of (c) where the inputs are correlated, the average output firing rate is about 2.5 spikes in 60 msec, or about 42 spikes/sec.
1256
DaLrid P. M. Northmore and John G. Elias
relatively inactive, activity of the other source generates approximately equal excitation and inhibition, and little output spike firing by the neuromorph. As both inputs simultaneously increase in frequency, the inhibitory synapses that they share in common sum their effects sublinearly, failing to oppose the total excitation generated and increasing the rate of output firing. With the synaptic arrangements shown and the dendritic membrane constant of 7 msec, output firing reached a maximum when the mean frequencies of both inputs were about 170 spikes per second. As input frequencies increased still further, output firing declined. This effect was caused by excitation leveling off before inhibition. Because the excitatory synapses were located proximally (second compartment) on the dendritic tree, although on different branches, they were electrotonically closer to each other than were the inhibitory synapses. Consequently, the excitatory responses summed less than the inhibitory responses at high input frequencies. The input spike frequency response shown in Figure 5b can be shaped by an appropriate selection of synaptic connections, dendritic tree dynamics, and output spike-firing threshold voltage. As an example, if the excitatory and inhibitory synapses shown in Figure 5a are switched, the highest response is evoked by A or B firing alone. 3.5 Pulse Pattern Discrimination. Similar principles of combining linear and sublinear summation can be applied to the discrimination of purely temporal patterns of input spikes. An illustration of how synaptic connections can be arranged to discriminate pairs of three-pulse patterns is shown in Figure 6. The problem posed was the discrimination of temporal patterns of pulses such as 1 1 1 versus I 1 I, where the number of pulses and their overall duration are equal. The neuromorph was required to fire spikes following only one of the two input trains, that designated the positive stimulus. In the solution shown in Figure 6a, the input pulses were delivered directly to an excitatory synapse on one branch, and a delayed version to an excitatory synapse at the corresponding site on a separate branch. The requisite delay was generated by the delayed synaptic connection facility provided by the "virtual wire" system. Figure 6b shows results of experiments in which the interval between the first and third spikes was fixed at 15 msec and the middle spike was stepped by 2 msec increments between them. The neuromorph output spikes accumulated over 10 trials are shown on the same time scale as the input pulses. In the case shown in Figure 6b, spikes were generated only when the second input pulse occurred at or before the midpoint of the first and third pulses. With the positive stimulus configuration (i.e., short intervals between first and second pulses), the delayed version of the second pulse coincided with the third input pulse on different branches, leading to nearlinear summation. Coincidences that occurred between pulses on the same branches were subject to sublinear summation and were therefore
Spike Train Processing
1257
A
B
Frequency of A (spikedsec) -
Figure 5: Neuromorphic response to combinations of two input train frequencies. (a) Synaptic connections of trains A and B. Open circles, excitatory synapses; filled circles, inhibitory synapses. (b) Contour plot showing neuromorph mean output spike firing frequency as a function of the mean frequency of the Poisson distributed trains, A and 8.Contours are numbered in spikes/sec.
David I? M. Northmore and John G. Elias
1258
(a) input sptke patterns
u
Delay
output spikes
Input Spikes
f-’ I I
I I
I
I
I
I
I
I I
1
1
2
1
4
1
6
1
8
1
10
1
12
1.o
I IB
0.9
I I
0
II
0
I
I 0
1.0
I
I
I I
Spike Prob. 0.8
!I B
1
14
0
1
16
1
18
1
20
lime (msec)
Figure 6: Discrimination of spike patterns: 1 1 I versus j i 1. (a) The three-spike input train was connected to a two-branch neuromorph as shown. Activation on one branch was delayed by 10 msec. (b) The timing of the second input spike was varied. Neuromorph output spikes are shown superimposed from 10 repetitions of each input pattern. Neuromorph fired a single spike with probabilities shown when the second input spike occurred in the first half of the 15-msec period.
Spike Train Processing
1259
less effective in firing the neuromorph. However, the coincidence between the delayed first pulse and the undelayed second pulse also takes place on different branches but occurred before the membrane potential had risen sufficiently to reach spike firing threshold. The spike generator threshold had to be adjusted appropriately to give good discrimination. Changing the delay to 5 msec reversed the discrimination, making 1 I I the positive stimulus.
3.6 Postspiking Refractoriness. A possible role for the refractory period in a neuromorph's response to synchrony was investigated in the manner of Bernander et al. (1994) by delivering trains of a fixed number of spikes over a variable time period, T . In the present experiments, each spike of the input train was delivered to an excitatory synapse on a different dendritic branch of an eight-branch neuromorph, all synapses being located at the same distance from the soma (Fig. 7a). The purpose of separating successively activated synapses on different branches was to assess the effects of refractoriness in the absence of sublinear summation. Figure 7b shows that this was successful. With eight synaptic activations on separate branches, the number of spikes generated by the neuromorph was maximal at the lowest train duration, Tcf = 800 spikes/sec), when there was maximal temporal summation of EPSPs. Spike output fell off monotonically as the train duration increased (i.e., as train frequency decreased), due to diminishing temporal summation. This is to be contrasted with activations on the same branch, when sublinear summation substantially reduced the response to the most synchronized trains, producing a tuned frequency response. We used two different methods of making the neuromorph refractory following spike firing. In the first, neuromorph output spikes triggered a one-shot that held the capacitor, C, of the spike generator at ground potential for a time that determined the absolute refractory period. Because the spike generator was isolated from the dendritic tree by a source follower, this procedure had no effect upon the integrating function of the dendrites. Figure 7c shows that introducing a refractory period by this means limited the maximum number of output spikes generated, without achieving any spike-frequency tuning. The second method shunted the postsynaptic potentials within the dendrite by routing output spikes to proximal shunting inhibitory synapses on all eight branches. This had the effect of clamping the membrane potential to Vrestfor a programmable time. Figure 7d shows that the maximum number of output spikes was again reduced, but the most synchronized trains (800 spikes/sec) were somewhat less effective than trains spread out over longer durations. This modest spike-frequency tuning effect may be contrasted with that produced by sublinear summation (Fig. 7b).
David P M Northmore and John G. Ellas
1260
I
n
Dendnuc Shunt
(d)
R
Tram Duration, 7 ImizcI
T r m Duration. T (mseci
Figure 7: Comparison of sublinear synaptic summation with refractoriness. (a) To obviate the sublinear effect, each spike of an eight input spike train was delivered to a different branch on an eight-branch ADT. To show sublinear summation effects, alf eight input spikes activated the same synapse on one branch (not illustrated). The eight input spikes were distributed over a variable time period, T. (b) Mean number of spikes output by neuromorph as a function of T , for synaptic activation on separate branches and on same branch. (c) Mean spike output as a function of J when soma was made refractory for 2 msec. (d) Mean spike output as a function of J when output spikes activated proximal shunting synapses for 2 msec on ail branches.
4 Discussion 4.1 Dendrites, Real and Artificial. Dendrites of real neurons are much more complex than the electrically passive structure of the classical model. The present VLSI modeling of dendrites is further constrained by what can reasonably be done using integrated resistors, capacitors, a n d transistors. Despite the simplifications a n d expediences of their current design, o u r artificial dendrites have certain features, in addition to
Spike Train Processing
1261
those responsible for the classical cable properties, that help them mimic physiological processes. The use of switched-capacitors allows a very compact VLSI implementation of resistors with extremely high values, so that in emulating the axial and membrane resistances of dendrites, it is possible to use membrane capacitances of small value, and consequently small area. An added bonus is the capability of changing the axial and membrane resistances very easily with the clock frequencies controlling the switched-capacitors. The ability to make rapid adjustments to the membrane resistance, and therefore time constant, provides a way of simulating the effects of background synaptic activity or modulatory influences that could affect excitability and temporal processing (Bernander et al. 1991; Agmon-Snir and Segev 1993). A similar capability to adjust axial resistances, while not actively used by neurons as far as we know, provides a means of readily setting the electrotonic length of the dendritic branches. While the phenomenon of sublinear synaptic summation has been fully described (eg., Rall 1964; Shepherd and Koch 1990; Koch and Poggio 1992)its role in dendritic information processing has not been demonstrated. Ferster and Jagadeesh (1992) argued convincingly that sublinear summation occurs in pyramidal cells of visual cortex. Their measurements suggested that the dendritic membrane can be almost completely depolarized by normal visual stimulation, with a consequent reduction of individual EPSP amplitude. Simulations with anatomically characterized neurons also suggest that multiple excitatory activations result in a reduction of EPSP amplitude (Bush and Sejnowski 1993; Me1 1993). Sublinear effects, including saturation, are likely to be important at points of high input resistance such as thin dendrites and dendritic spines, and for inhibitory synapses whose reversal potential generally lies close to the average membrane potential. The connections responsible for coincidence detection (Fig. 5), for example, depended upon sublinear summation of inhibitory inputs. Our fixed-weight synapses, after one or two closely spaced activations, produce a membrane polarization somewhat equivalent to that in a real dendritic branch after an intense afferent bombardment of many weak synapses. The intensity of the bombardment required to produce this state of polarization will depend upon the local input resistance. Given that such states are obtained in real neurons, the rules governing linear and sublinear summation listed in the next section would apply with the kind of consequences our results have already demonstrated.
4.2 Linearity and Nonlinearity of Synaptic Connection Patterns. Here we summarize the possible interactions involving synchronous depolarizing and hyperpolarizing synaptic inputs on the ADT. Synchronous synaptic activation is taken to mean that the activations take place within approximately one membrane time constant.
1262
David P. M. Northmore and John G. Elias
1. Near-linear summation is seen when synchronous synaptic activations occur at electrotonically distant sites, or when synchronous depolarizing and hyperpolarizing activations occur at electrotonically neighboring sites. 2. Sublinear summation occurs when a set of neighboring synapses of the same type, i.e., a cluster, is activated synchronously. This includes the case of a single synapse being activated at intervals less than about one membrane time constant. These rules are useful in an intuitive approach to designing synaptic connectivities for various spike processing applications. Alternatively, systematic search procedures could be used (Elias 1992; Northmore and Elias 1994). 4.3 Temporal Processing of Spike Trains. Frequency-selective neural circuits play an important part in neural information processing, particularly in sensory systems where the temporal structure of environmental signals must be analyzed. For example, neurons in the auditory system of frogs fire preferentially to the frequency of amplitude modulation present in the mating calls of conspecifics (Rose 1986). The mechanism whereby these cells respond to specific spike train frequencies could involve the dynamics of certain membrane conductances (Zucker 1989; McCormick 1990; Midtgaard 1994), although the present work suggests that the appropriate pattern of excitatory and inhibitory synapses on passive dendrites is sufficient. The basic processes underlying temporal discriminations are best understood from the two-pulse experiments. At synaptic activation intervals much longer than the membrane time constant a failure of temporal summation to exceed the spike firing threshold leads to ”high-pass” characteristics with spike trains. As activation intervals shorten, the growth of the soma potential is limited by the saturation of the membrane potential in the neighborhood of the activated compartments, with a corresponding sublinearity in spike firing output. Given an appropriate setting of spike firing threshold, these two processes generate selectivity for input spike intervals; given control over axonal delays, discrimination of more complex temporal patterns of input spikes is possible. ”Band-pass” characteristics of the kind required of modulation detectors in frogs depend upon the interplay of near-linear and sublinear summation across different regions of the dendritic tree. Opportunities for tuning the peaks include changing dendrite dynamics (via X, and R,) (Elias and Northmore 1995) and adjusting synaptic connectivity. An alternate mechanism for generating spike-frequency selectivity, refractoriness following spike firing, can also limit a neuron’s responsiveness to high input frequencies if synaptic activations are wasted during the refractory period. Using a single compartment model, Bernander ef al. (1994) showed that refractoriness resulted in fewer output spikes
Spike Train Processing
1263
being generated in response to synchronized inputs than to relatively desynchronized inputs. Our neuromorph exhibited no such effect when its "soma" was made refractory (Fig. 7c) because it is isolated from the artificial dendritic tree where most of the integration of postsynaptic potentials takes place. Refractoriness in this case then merely reduces the gain of the process that translates the potential at the trunk of the dendritic tree into spikes, and does it regardless of input activation frequency. When, on the other hand, refractoriness was produced by shunting the dendritic tree after each output spike, a modest fall off in response to high frequencies (i.e., synchronized inputs) was observed (Fig. 7d). Similar considerations are likely to apply to real neurons in that refractoriness of the soma will be much less discriminatory against activations occurring distally on dendrites rather than proximally. If, however, somatic spikes propagate into the dendrite tree and activate conductances that shunt synaptic potentials there (Stuart and Sakman 1994), some wastage of highly synchronous synaptic inputs will occur. Sublinear synaptic summation, on the other hand, operates locally, and therefore with greater specificity as to input source. Neurons possess a variety of membrane channels and intracellular mechanisms that confer highly nonlinear properties (for reviews see Me1 1994; Midtgaard 1994), some of which facilitate synaptic transmission after repetitive synaptic activation, and some of which, like sublinear summation, depress it. From modeling facilitatory synaptic processes (e.g., Me1 1993; Buonomano and Merzenich 1995) it has been shown that neurons could perform temporal discriminations and correlations of the kind explored in this work. That artificial dendritic trees, composed of simple, linear components, can also perform such discriminations, given the appropriate pattern of input connections, seems to suggest some fundamental properties of dendritic structure to which the complexities of active membrane channels may be considered functional add-ons.
Acknowledgments Research supported by National Science Foundation Grant (BCS-9315879), and by the University of Delaware Research Foundation. The authors wish to thank the reviewers for their comments and suggestions.
References Agmon-Snir, H., and Segev, I. 1993. Signal delay and input synchronization in passive dendritic structures. 1.Neurophysial. 70, 2006-2085. Bernander, O., Douglas, R., Martin, K., and Koch, C. 1991. Synapticbackground activity influences spatiotemporalintegrationin single pyramidal cells. Proc. Natl. Acad. Sci. U.S.A. 88, 11569-11573.
1264
David P. M. Northmore and John G. Elias
Bernander, O., Koch, C., and Usher, M. 1994. The effect of synchronized inputs at the single neuron level. Neurnl Conip. 6 , 622-641. Buonomano, D. V., and Merzenich, M. M. 1995. Temporal information transformed into a spatial code by a neural network with realistic properties. Sciencr, 267, 1028-1030. Bush, P. C., and Sejnowski, T. J. 1993. Simulations of synaptic integration in neocortical pyramidal cells. In Cotiipirtntioi? atid Nezirnl Systcnzs. F. H. Eeckman and J. M. Bower, eds., pp. 97-101. Kluwer, Boston. Douglas, R., Mahowald, M., and Mead, C. 1995. Neuromorphic analogue VLSI. Allnu. Rev. Neiirosci. 18, 255-281. Eccles, J. C. 1957. Tlzr Pli!ysio/og/ of N e r w Cells. Johns Hopkins Univ. Press, Baltimore, MD. Elias, J. G. 1992. Genetic generation of connection patterns for a dynamic artificial neural network. In Combiizntiom ufGeiietic Al~oritlzrrisnrzd Neurnl Neti.oorks, L. D. Whitley and J. D. Schaffer, eds., pp. 38-54. IEEE Computer Society Press, Los Alamitos, CA. Elias, J. G. 1993. Artificial dendritic trees. Neurnl Conip. 5, 648-664. Eiias, J. G., and Northmore, D. P. M. 1995. Switched-capacitor neuromorphs with wide-range variable dynamics. IEEE Trmzs. Nrirt17l Networks 6(6), 15421548. Elias, J. G., and Northmore, D. P. M. 1996. Oscillatory networks of silicon neurons. In preparation Ferster, D., and Jagadeesh, B. 1992. EPSP-IPSP interactions in cat visual cortex studied with in vivo whole-cell patch recording. I. N m m c i . 12, 1262-1274. Jack, J. J . B., Noble, D., and Tsien, R. W. 1975. Electric Ciirrtwt F1ow iri Excitable Cdls. Clarendon Press, Oxford. Koch, C., and Poggio, T. 1992. Multiplying with synapses and neurons. In Sirzgk Nrwrori Corirpirtntiori, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 315-345. Academic Press, San Diego, CA. Koch, C., and Segev, I. 1989. Methods iri Neirrorinl Moddirig. MIT Press, Cambridge, MA. Mead, C. 1989. Atinlug VLSI nt7d Netirnl Sysfrm. Addison-Wesley, Reading, MA. Mel, B. W. 1993. Synaptic integration in an excitable dendritic tree. 1. Neirrophysiol. 70, 1086-1101. Mel, B. W. 1994. Information processing in dendritic trees. Nczrrnl Coriip. 6, 1031-1085. McCormick, D. A. 1990. Membrane properties and neurotransmitter actions. In Syinptic- Orgnrihntiori ( f t l i e Brinri, G. M. Shepherd, ed., pp. 32-66. Oxford Univ. Press, New York. Midtgaard, J. 1994. Processing of information from different sources: Spatial synaptic integration in the dendrites of \vertebrate CNS neurons. Trtwds Nricrosci. 17, 166-173. Murthy, V. N., and Fetz, E. E. 1994. Effects of input synchrony on the firing rate of a three-conductance cortical neuron model. Nc7irrnl Coriip. 6 , 1111-1126. Northmore, D. P. M., and Elias, J. G. 1994. Evolving synaptic connections for a silicon neuromorph. Pruc. E E E Corf. Ezdirtiorznry Corriptntioii, Orlnrido, 2, 753-758.
Spike Train Processing
1265
Rall, W. 1964. Theoretical significance of dendritic trees for neuronal inputoutput relations. In Neural Theory and Modelling, R. F. Reiss, ed., pp. 73-97. Stanford Univ. Press, Stanford, CA. Rall, W., Burke, R. E., Holmes, W. R., Jack, J. J. B., Redman, S. J., and Segev, I. 1992. Matching dendritic neuron models to experimental data. Physiol. Revs. 4 (Suppl), S159-Sl86. Rose, G. 1986. A temporal-processing mechanism for all species? Brain Bekau. E d . 28, 134-144. Shepherd, G. M., and Koch, C. 1990. Dendritic electrotonus and synaptic integration. In Synaptic Organization of the Brain, G. M. Shepherd, ed., pp. 439473, Oxford Univ. Press, New York. Singer, W., and Gray, C. M. 1995. Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci. 18,555-586. Stuart, G. J., and Sakmann, B. 1994. Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature (London) 367, 69-72. Zucker, R. S. 1989. Short-term synaptic plasticity. Annu. Rev. Neurosci. 12,13-31.
Received August 21, 1995; accepted December 18, 1995
This article has been cited by: 2. John G. Elias , David P. M. Northmore , Wayne Westerman . 1997. An Analog Memory Circuit for Spiking Silicon NeuronsAn Analog Memory Circuit for Spiking Silicon Neurons. Neural Computation 9:2, 419-440. [Abstract] [PDF] [PDF Plus]
Communicated by Haim Sompolinsky
A Large Committee Machine Learning Noisy Rules R. Urbanczik Nordita, Blegdamsvej 17, DK-2100 Copenhagen 0,Denmark
Statistical mechanics is used to study generalization in a tree committee machine with K hidden units and continuous weights trained on examples generated by a teacher of the same structure but corrupted by noise. The corruption is due to additive gaussian noise applied in the input layer or the hidden layer of the teacher. In the large K limit the generalization error eg as function of a, the number of patterns per adjustable parameter, shows a qualitatively similar behavior for the two cases: It does not approach its optimal value and is nonmonotonic if training is done at zero temperature. This remains true even when replica symmetry breaking is taken into account. Training at a fixed positive temperature leads, within the replica symmetric theory, to an nPkdecay of eg toward its optimal value. The value of k is calculated and found to depend on the model of noise. By scaling the temperature with a, the value of k can be increased to an optimal value kept. However, at one step of replica symmetry breaking at a fixed positive temperature cg decays as a-kopt.So, although eg will approach its optimal value with increasing sample size for any fixed K, the convergence is only uniform in K when training at a positive temperature. 1 Introduction
In many cases feedforward networks can be used to estimate an unknown function from examples and much effort has been devoted to understanding this generalization ability. One approach is to characterize the computational power of a network. In statistical mechanics for networks with a binary output this leads to the concept of storage capacity, that is, the number of random input/output pairs that can be implemented by the net. This concept is closely related to the Vapnik-Chervonenkis dimension of the net (Opper 1995) and both quantities may be used to make general statements on the generalization ability of a network. In a slightly different approach, statistical mechanics can be applied to the study of a neural network learning a specific task; a review is given in Watkin et al. (1993). The focus in this approach has been mainly on single layer networks and, for more complex architectures, on realizable tasks where the function to be learned can be exactly implemented by the network. Here, we calculate the generalization behavior of a twoNeural Computation 8,1267-1276 (1996) @ 1996 Massachusetts Institute of Technology
R. Urbanczik
1268
layer network, a tree committee machine with many hidden units, when confronted with specific instances of an unrealizable task. The realizable case for this architecture has been considered in Mato and Parga (1992); Schwarze and Hertz (1992), and for the fully connected committee in Schwarze and Hertz (1993), Kang rt 01. (1993), and Schwarze (1993). The corresponding unrealizable task for the connected architecture is treated in Urbanczik (1995). By considering an unrealizable task, we address a learning problem that is perhaps more realistic. Further, this will allow us to interpolate between the realizable case and the task of implementing a random mapping considered in the capacity problem. For the former it was found that even in the limit of infinitely many hidden units the network is able to generalize when trained on o N patterns, where N is the number of weights in the network (Schwarze and Hertz 1992). Remarkably, the capacity calculations indicate that the number of patterns that can be stored per weight diverges with the number of hidden units (Barkai ~f ill. 1992; Engel ~t 01. 1992). The generalization behavior can be used to bound the capacity, and this yields good approximations to the capacity, as derived by replica methods, for the perceptron and the parity machine but not for the tree committee (Opper 1995). The committee machine consists of K hidden units, each characterized by a weight vector I, and receiving an input E,. Both I, and <, are N f K dimensional real vectors and J, is normalized to 1. Presented with an input ( ( E l ) the machine computes 1
Ti(<)
=
I
I,:,
(1.1)
sign C s i g n ( j l ( , )
The teacher is characterized by the normalized weight vectors computes r
h
TI and
1
Here the ilk will be assumed to be independent gaussian random variables with variance
If - 1 -)l = 1 the output of the teacher is deterministic, while it is independent of < and random if = 0. For K = 1 and -!2 = 1 this is equivalent to the model studied in Gyorgyi and Tishby (1990) for the perceptron. We shall assume 0 < -f1-;2 < 1 throughout this paper. To simplify the calculations, the large K limit will be considered (K + cx but K << N). The generalization error f g of the network characterized by I, the student, is defined as the average of H [ - T , ( < ) T ~ ( < . t / ) ] over rl and <,where the E,, are drawn independently from the normal distribution and H is the
Committee Machine Learning Noisy Rules
1269
]Ti,
Heaviside step function. If the overlap does not depend on the site index i and is equal to R, we have for large K 7r
For R = 1 the minimal value cmin of cg is attained, which also quantifies the amount of noise in the output of the teacher. The optimization of tg with respect to J is based on a training set of aN examples generated by picking inputs (1' and noise terms 7f independently from the above distributions and assigning target outputs 71' by (1.2). The training error of a student is then given by
This can be used to define a probability density p ( J ) with Boltzmann weight ecBfiNft('), where the inverse temperature /3 = 1/T controls the penalty assigned to students with positive training error. We study typical properties of a student picked from p ( J ) , that is, properties which hold for almost all choices from p ( J ) and almost all training sets in the limit N 03. Our main interest is to calculate, as function of a, the typical generalization error cg of a student picked from p ( J ) . The replica formalism is used to determine this learning curve. We first assume replica symmetry (RS) and then discuss the corrections given by one step of replica symmetry breaking. The implications for finite K of the results derived in the large K limit are discussed in the concluding section. 2 Replica Symmetric Theory
For the RS-estimate F of the quenched average of the free energy per weight associated to p ( J ) one finds, by methods entirely analogous to Schwarze and Hertz 1992): -PF
=
U(x)
=
G,
=
extr G,(R, q, P )
+ G,(R, 4 )
+ (1 e p a ) H ( x ) ] 2 l p1-- R q 2 + ln(1 q)
In 1e-O
-
-
The effective values of the order parameters occurring in the energy term are given by qe = $(q) and Re = -&(rlR) for $(x) = 1 - arccosx. The value of q is the overlap between the J I in different replicas and is assumed independent of the site index i. The interpretation of R is the one given
5
R. Urbanczik
1270
0.4
0.38
0.36 0.34 0.32
20
60
40
80
100
a Figure 1 Learning curves at T = 0 and,,,f - I/&), the lower one for -,2 = 1 ('1
=0
25 The top curve is for
71
=1
= 0 896)
above and thus the relationship between R and o given by the extremal condition is the central object of our study. The above expressions are independent of K and indeed only valid for 1 - q >> 1/K. We shall first consider learning at zero temperature, which corresponds to the strategy of allowing only those students that perform optimally on the training set. For finite (1 the equations have to be solved numerically, and typical learning curves are shown in Figure 1. The generalization error does not approach ,,,f and is nonmonotonic. For q i 1 the integration in 2.1 can be done analytically, but the resulting equations for the order parameters are transcendental. The value of R approached as ( t x is plotted in Figure 2. One may also verify that for sufficiently high o the value of R decreases with o . So the form of the learning curves given in Figure 1 seems to be generic. In some ways these results are quite similar to those for the perceptron (Gyorgyi and Tishby 1990) There an increasing generalization error was found when training on corrupted outputs in an interval [(to. (kL], where (I, corresponds to the maximal number of patterns for which a zero training error can be achieved. In the present case, as a consequence of the large K limit and in agreement with the capacity calculations, ( I , = x.So it is not surprising that fI: does not approach Fmln, since the teacher has positive error on the training set. The generalization error behaves very differently if training takes
-
Committee Machine Learning Noisy Rules
1271
1 0.8 0.6 0.4
0.2
0 emin
Figure 2: Asymptotic value of R as function of emin at T = 0. The top curve is for 7 2 = 1, the value of 71 is implied by equation 1.4 for R = 1. The lower curve is for y1 = 1. place at T > 0. Learning curves for different temperatures are shown in Figure 3. They suggest that the student will approach optimal performance for any positive temperature. For 9 -+ 1 the asymptotic form of G, is given by
Here
is a function that increases faster than linearly with 13 and approaches zero. For large cr the value of R can be calculated analytically and one finds to leading order: $(/j)
$(B) 0: P2 as P
if y1 < I. 1 - 9 0: 0-a if y1 = I. y2 < 1
w f .1 - 9 K 0-l
cg
- Em,,
K
cg
- Em,,
K 0-t.
(2.3)
Formally, the difference between the two cases is due to the divergence at R = y1 = 1 of the derivative RL of Re with respect to R. Note, that the difference between the T = 0 and T > 0 asymptotics does not imply that cg is discontinuous at T = 0 for any fixed 0. It does imply that the continuity in T is not uniform with respect to (t. A similar, if less dramatic, improvement of cg by training at a positive temperature has already been observed in Seung et al. (1992) for a perceptron learning an unrealizable rule.
R. Urbanczik
1272
0.36 0.34 0.32
0.3
0.28 0.26 t . .
I
40
20
80
60
100
a Figure 3: Learning curves at temperatures.
= 0.25,
fnlIIl
-,2
=
1
(-.I
= 0.896) and different
The prefactors omitted In ( 2 3 ) depend on j, so we can speed up learning by scaling 1 with ( I The fastest possible asymptotic decay of c g IS then to leading order ' f i - fmin X 0 li cR-f,,"X~l
-7
-
7
l-L]X(I
4 i
if
j x K 1 5- 1 < 1
l-qu(l
-17
If
jX(l l 7
3 Replica Symmetry Breaking
-1=1
- 2 < l
( 2 4)
~
Replica symmetry breaking was found in the capacity calculations, so we may expect it here too. By calculations similar to Barkai et 01. (1992) and Gyorgyi and Tishby (1990), the AT-condition for the local instability of the replica symmetric saddlepoint is r r X I ( 1 - 4)' 2 1, where
+U"
(T i - ~ )(A)'] ~ (3.1) 1-q,
1-qe
U is the function defined in 2.1. In Figure 4 the minimal f t for which the condition holds is shown for T = 0 as function of the level of noise.
Committee Machine Learning Noisy Rules
1273
emin Figure 4: The value of CY as function of emin for which the RS-saddlepoint becomes unstable at T = 0. The higher curve is for 7 2 = 1, the lower for 71 = 1. It interpolates between the noiseless case, where the RS-saddlepoint is always stable, and the value found in the capacity problem. Thus a better estimate F of the free energy should be obtained by one step of replica symmetry breaking. One finds:
-pi.
=
extr
G r ( k 40.91. m, D ) + G S ( k4 0 . 4 1 , m )
R.qo.91 .m
(3.2)
The relationship between the order parameters and their effective values in the energy term is the same as in the RS-theory. The equations 3.2 do not hold unless 1 - 41 >> 1/K.
R. Urbanczik
1274
At T = 0, the same scaling of the order parameters for (Y --+ co may be used as in the capacity problem (Urbanczik 1994) and we find
+
G r ( kqo. cm) G,(R. qo) -
(3.3)
Thus R is equal to its replica symmetric value R and the generalization performance is unaffected by the breaking of replica symmetry. As long as the student space does not contain the teacher, its geometry is unimportant. The stationary values of the other order parameters scale as ln[(l-41)p'] 0: ln(m-') cx (1- q O ) - l 0: 0'. Thus, /IF diverges very quickly and the large K expansion used in deriving 3.2 is only self-consistent if 0
<<
m.
For the asymptotics at fixed positive T one first eliminates 91 by the extremal condition and finds 1 - 41 0: m-40p4. Furthermore
Thus the values of k and qo approach the corresponding ones of the RS-theory at temperature mi). In a formal similarity to the discrete case (Krauth and M6zard 1989),stationarity with respect to m leads to a condition on the RS-entropy S(/j) of the system, namely
1 S ( m l j ) = -2 - in(m5ru4) For large n and the stationary values of R and
(3.5) 90
one finds using 2.2: (3.6)
If rn and are fixed, R and 90 behave as R and 9 in the RS-theory 2.3 and S(m/j) diverges as a power of (1. So 3.5 implies m -+ 0. Further, 3.5 and the stationarity condition for 90 yield m cx (1- 90)"~. The analogous relation /j K (1- 4)'14 holds in the RS-theory when @ is scaled to achieve optimal generalization. And indeed, the one step theory predicts that at fixed temperature eg decays to cmin as the same power of 0, as we found in the RS-theory by optimally scaling T with N 2.4.
4 Discussion The main implication of the above findings is that although cg does approach tmin as a -+ 00 for any fixed K , this convergence is only uniform in K if training takes place at T > 0. For T = 0 this can be understood by observing that the training error of the optimal student 7 will stochastically converge to t m i n with increasing 0. Consequently,at zero temperature J will not even lie in the version
Committee Machine Learning Noisy Rules
1275
space [Boltzmann density p ( 7 ) = 01 as long as students with lower training error exist, and good generalization will be achieved only when a has the magnitude of the critical capacity ac in the random map problem (yly2 = 0). Recently a!,
=
1GJlnK K
has been found in Monasson and Zecchina (1995). Although replica are used there as well, in contrast to this paper, the space of interactions is decomposed into convex subsets of students by the criterion that students in one subset should give rise to the same internal representation. Note that this scaling is entirely compatible with the self-consistency condition N << of the large K theory presented above. This suggests that one step of RSB is sufficient for large a. Since & m N and not just nN patterns are needed for good generalization at T = 0, it is interesting to consider the shape of the rescaled learning curve, c g ( & ) as function of 8 for K -+ m For small 6 the generalization error will be constant, the value of R being the one found in the limit a! + 00 (Fig. 2). As 8 increases e g ( & ) will approach em,,. But it may be that ~ ~ (= 6E,, ) holds already for some finite value of 6. At any positive temperature a competition between energy and entropy takes place and the probability P(Et)that a student picked from p ( J ) has training error ct is given by
a
P(ct)c( e-pnNrt V K ( f t )
(4.2)
Here VK(tt)denotes the volume of students having training error Et. The most probable value of Et will approach ~ , i , if VK(ft) shrinks quickly enough as a + m. A sufficient condition is that with increasing a! N-'lnvK(Ft)
<< -a!
(4.3)
for a fixed et < emm. So no change of scale will occur in the learning curve at T > 0 if 4.3 holds even when the large K limit is taken first.' In contrast to learning at zero temperature, we do not need that V K ( c t ) actually vanish for sufficiently large a and Et < E,,,. By finding upper bounds on V K ( t t ) , it may be possible to show in the T > 0 case that the scale of the learning curve does not change with increasing K for a larger class of unrealizable rules than the ones considered here. Acknowledgments
I thank J. Hertz for his valuable advice and direction and M. Sporre for many useful discussions. This work was supported by the E. BatscheletMader Foundation and the M. Geldner Foundation. 'For
tt
=0
and large K equation 3.3 yields that typically In[-N-' In V,(O)] x a'.
R. Urbanczik
1276
References Barkai, E., Hansel, D., and Sompolinsky, H. 1992. Broken symmetries in multilayered perceptrons. Phys. Rev. A 45, 41464161. Engel, A,, Kohler, H. M., Tschepke, F., Vollmayr, H., and Zippelius, A. 1992. Storage capacity and learning algorithms for two layer neural networks. Phys. RcY~.A 45, 7590-7607. Gyorgyi, G., and Tishby, N. 1990. Statistical theory of learning a rule. In Neural NetzoorksirndS,viri GInsses, K. Thuemann and R. Koberle, eds., pp. 3-36. World Scientific, Singapore. Kang, K., Oh, J.-H., Kwon, C., and Park, Y. 1993. Generalization in a two layer network. Phys. Rest E 48, 48054809. Krauth, W., and Mezard, M. 1989. Storage capacity of memory networks with binary couplings. 1. Plzys. France 50, 3057-3066. Mato, G., and Parga, N. 1992. Generalization properties of multilayered neural networks. 1. Phys. A 25, 5047-5054. Monasson, R., and Zecchina, R. 1995. Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Ph!/s. R m Lett. 75, 2432-2435. Opper, M. 1995. Statistical physics estimates for the complexity of feedforward neural networks. Ph!js. Reu. E 51, 3613-3618. Schwarze, H. 1993. Learning a rule in a multilayer neural network. I. Phys. A 26, 5781-5794. Schwarze, H., and Hertz, J. 1992. Generalization in a large committee machine. Enrophys. Lett. 20, 375-380. Schwarze, H., and Hertz, J. 1993. Generalization in fully connected committee machines. Eiirophys. Lett. 21, 785-790. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. P l y s . R w . A 45, 6056-6091. Urbanczik, R. 1994. Storage capacity of the tree-structured committee machine with discrete weights. €iirophys. Lctt. 26, 233-238. Urbanczik, R. 1995. A fully connected committee machine learning unrealizable rules. I. Phys. A 28, 7097-7104. Watkin, T., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Reu. Mod. Phys. 65, 499-556. ~
-.
~
Received January 24, 1995, accepted January 8, 1996
This article has been cited by: 2. R Urbanczik. 1997. Journal of Physics A: Mathematical and General 30:11, L387-L392. [CrossRef] 3. R Urbanczik. 1996. Learning in a large committee machine: worst case and average case. Europhysics Letters (EPL) 35:7, 553-558. [CrossRef]
Communicated by Peter Bartlett
Vapnik-Chervonenkis Generalization Bounds for Real Valued Neural Networks Arne Hole Department of Mathematics, University of Oslo, Box 2053, Blindern, N-0316 Oslo, Norway
We show how lower bounds on the generalization ability of feedforward neural nets with real outputs can be derived within a formalism based directly on the concept of VC dimension and Vapnik's theorem on uniform convergence of estimated probabilities. 1 Introduction
Concerning historical background on the probably approximately correct (PAC) learning model and related issues, I refer to Haussler (1992) and the references therein. The results we obtain in this paper are of the following format, roughly described: Let a neural net architecture and a learning algorithm be given. Suppose that you choose a training set x consisting of m examples at random, give it as input to the learning algorithm, and observe that the learned functionfi has (mean) error 5 yf on the training set x. Then the probability that fi has global mean error larger than E is less than B, where B is a bound. The bound B will depend on m, and also on some other quantities. Note that the probability we want to bound is a conditional probability; it is the probability that f i has global error larger than F given that it has been observed to have error 5 yf on the training set. Comparing the results of this paper to the results on feedforward neural networks given in Haussler (1992), the main difference lies in the domain of applicability. The "sharp" learning criterions considered in the first part of this paper are not covered by Haussler's feedforward network results. On the other hand, Haussler's treatment is far more flexible than the formalism presented here, and it covers a vast number of situations where the results of this paper do not apply. However, concerning learning with respect to continuous "loss functions" (which is treated in the second part of this paper), some comparisons of results can be made. The bounds we obtain in Section 11for the special classes of network models considered there are stronger than the bounds obtained for such networks in Haussler's paper (and related works). But then Neurnl Canrputntion 8, 1277-1299 (1996) @ 1996 Massachusetts Institute of Technology
Arne Hole
1278
again, the domain of applicability for our results is much more limited. Among other things, we rely on special properties of sigmoid-shaped activation functions. Also, we treat only the case of one hidden layer and one output node. In contrast, Haussler's results are valid for almost any kind of activation functions and node types, and for any number of layers. The present work is also related to other recent developments in complexity based learning theory of real valued functions. In particular, slightly weaker and/or different versions of the bounds obtained in Lemmas 1, 2, and 3 follow from results in Goldberg and Jerrum (1993), Maass (1993), Karpinski and Macintyre (1995), and Anthony et al. (1995). Related material can also be found in Alon et al. (1993), Barron (1993), Bartlett et al. (1994), Bartlett and Long (1995),Anthony and Bartlett (1995), Gurvits and Koiran (1995), and Lee et al. (1995). On notation: The set of real numbers is denoted R. If h is a set, then card(h) means the cardinality of h, and p(h) is the power set of h. The composition of two maps $ and 4 is denoted $1 o 4, i.e., 1c, o 4([) = $1[4(<)]. If A and B are sets, the set of functionsf: A -+ B from A to B is denoted Mup(A,B). The notation A'" means the m-fold Cartesian product of A with itself, for each integer m 2 1. If a = ( a l , . . . , a r N )E A'?' and b = (bl,. . . bm) E B"', then by (a;b) we mean the element in (A x B)"' given by (a; b), = (a', b,) for 1 5 i L rn. If a and p are events in some probability model with probability measure P, we write the conditional probability of p given (Y as P ( P I a ) . Thus P(/3 I a ) = P(a n / ~ ) / P ( ( Y ) . Concerning the organization of the article, I have chosen to treat learning with and without noise as two different cases, starting with the noiseless case. The paper is self-contained with respect to definitions and formalism. All the proofs given are "local," i.e., they can be skipped without losing the thread of the paper. ~
2 VC Dimension and Related Concepts
Let h be an arbitrary set, let rn 2 1 be an integer, and let s = (s1,. . . ,s,,,) be an arbitrary ordered sequence of rn objects. We define s n h = {i 11 5 i
rn and s, E h }
If H is a family of sets, we define s n H = {s n h I h E H}, and put AH(s)= card(snH). Note that s n H C 65({l,. . . m}). If A,(s) = 2"', then H is said to shatter the sequence s. For each integer m 2 1, define
.
&(m)
= max{A,(s)
I s is a sequence of
m objects}
Let VCdim(H) be the greatest integer nz such that A,(m) = 2"', if such an m exists. Otherwise, let VCdirn(H) = +co. It is known (see, e.g., Vapnik
VC Generalization Bounds
1982) that if d
=
1279
VCdim(H)is finite, then (2.1)
for all m 2 d. Let A be a set, and let F
f'
=
C M a p ( A .R). For eachf
E F , let
{ ( P . t ) E A x R I t >f(P)J
u' If
Put Fi = E F } . Note that F+ C p ( A x R). The quantities AF+(rn) and VCdim(F+)will play an important part in the following. It is easily seen that VCdim(F+)is equal to the so-called pseudo-dimension of F, as defined in Haussler (1992). The result below is shown, among other places, in that paper. Observation. Let 41: R R be increasing [i.e., x > y + +(x) 2 $(y)]. Let A be a set and G C Map(A. R). Define F C Map(A. R) by F = { JI o g 1 g E G}. Then for all m we have A,+(m) 5 A,+(m). -+
3 Sharp, Noiseless Learning
By a noiseless learning situation (abbreviated "QL situation," Q for "quiet") we will mean a 9-tuple
A = (X.P.Y.F.fo.S.m.A.v) where
X is a set (called the input space) P is a probability measure on X Y is a set (called the output space) F 2 Map(X. Y) is a function class fo: X Y is a function (called the targetftlnction) S is a set m 2 1 is an integer A: X"' x S -+ F is a map (called the learning algorithm) v is a probability measure on X"' x S such that the marginal of on X"' is Pn' -+
N
We will usually write A(x,c)as A,: for each x E X"' and CJ E S. By a criterion map (or simply a criterion) for the QL situation (X. P,Y . F.fo>S . m. A. u ) will be meant a map
0: Map(X, Y)
+ p(X)
Arne Hole
1280
For eachf E M a p ( X . Y), the set O c f ) C X will be interpreted as the region of the input space where f "behaves well" relative to the target fo. We assume in the following that all combinations of criterion maps with QL situations considered are such that the standard measurability condition assumed in connection with Vapnik's theorem (Theorem 1 below) is satified. This is a mild condition that one need not worry about in practice. Consult Pollard (1984). Given a QL situation A = (X. P. Y. F.fo. S. m. A, u ) and a criterion map B for it, for each f E M a p ( X . Y) and x E X"'we define the error E c f . 0. x ) off with respect to B on x by 1 €cf. 8. x ) = . card{i I 1 I i I m and xi4 Scf)} m
For each f E Mup(X. Y) we define the (global) error respect to B by
€cf, 0)
of f with
E c f . 8 ) = 1 - P[H(,f)] Finally, for each t E [O. 11 let
In the context of the formalism we will develop in Section 10, it is natural to refer to the error measures defined above as "sharp." Hence we may refer to learning with respect to criterions 19 as defined in this section as "sharp" learning. We will use the following version of Vapnik's theorem. Theorem 1 (Vapnik). Let A = (X, P. Y, F,fo, S. m, A, v ) be R Q L situation, and let H be a criteriorzfor it. Let y E [O. l),F E ( 0 , l ) and m 2 4/[(1 - Y ) ~ F ]Then .
lrz other uwds,
The above version of the theorem is shown in Hole (1995). The proof follows the original one given in Vapnik (1982) closely. The first statement of the theorem gives an improvement on the bound given in Anthony and Shawe-Taylor (1993) by a factor of two, and on the bound given in Vapnik (1982) by a factor of four. It may be remarked that if the additional assumption is made that Fm is an integer, then the bound of Theorem 1 can (Hole 1995) be improved by an additional factor of two.
VC Generalization Bounds
1281
4 Interpretation
In this section I will discuss how the formalism of the preceding section can be interpreted in terms of neural networks. Let A = ( X ,P. Y,F.fo, S,m. A. u ) be a QL situation. Then X and Y can be taken as the input space and output space of a network architecture, respectively. The class F C M a p ( X . Y) can be viewed as the set of functions defined by the architecture (by varying weights and thresholds). The target fo: X + Y is the (possibly unknown) function we want the network to learn. It is not necessary that fo E F. The elements of x E X"' are training sequences of length m. The learning algorithm X associates a function in F to each element (x.g), where x E X"' is a training sequence and ~7 E S. The set S is included to model cases where the learning process used is not deterministic. In the deterministic case, we can take S = (0). Then the probability measure v on X"'x S reduces to the product measure Pf"of P on X"'. Now let us consider criterion maps 0. As hinted in the previous section, for each f E M u p ( X , Y) the set H ( f ) will be interpreted as the set of p E X such thatf(p) is "acceptable" when compared tofo(p). If Y = {-l.l} (the boolean case) the obvious choice for 0 is the map Bb given by QbCf) =
{p
E
If@)=fO(p))
for all f E M a p ( X ,Y). However, in the case Y = R corresponding to networks with real outputs, the criterion 0 b is too restrictive. Let K > 0 be fixed. A natural criterion B to consider in this context is the map On: defined by
Qn:Cfi = { P E x I I f ( P ) -fo(P)l 5.1 for all functions f E M u p ( X , R). Given 8, the quantity E ( f , 8, x) naturally plays the role as the (mean) error off on the sequence x of m points in X , and E ( f , 0) represents the global (mean) error off. The quantity n A ( t , 0) is the probability that the learned function X(x. g ) has error less than or equal to t on the training sequence x when ( x , ~ )is drawn at random according to v. Since the marginal of v on X" is assumed to be P", taking a random draw according to I/ can be interpreted as taking a random draw of x E X"' according to P"' and giving x as input to the (possibly stochastic) learning process. So whether or not the learning process is stochastic, we may conclude that the quantity f ? A ( t , 8) is the probability that the function resulting from the learning process has error 5 t on the training set, when the training set x E X" is drawn at random according to X'". Note that we are considering noiseless learning here; we assume that we have access to the function values fo(x,) for all elements XI,. . . x, in the training sequence. On the other hand, function values of fo on
.
1282
Arne Hole
training sequences are the only information aboutf” we need. The second statement in Theorem 1 now says the following: Suppose that you choose a training sequence x E X“‘ at random according to P”’, give it as input to the learning process, and observe that the resulting learned function fr has error less than or equal to Y E on the training sequence x, i. e., EV;. 0. x) 5 yf. Then the probability (with respect to choice of x) that EV;. 0) > c is less than
If 8 is the boolean criterion 01, defined above and F is the function class implemented by a feedforward neural network architecture with linear threshold units, the quantity AOcr,(2m)appearing in Theorem 1 can be estimated as in Baum and Haussler (1989). We will see in the following sections how bounds on As,(F)(m) can be obtained. However, to apply Theorem 1 we also need an estimate of the probability ~ A ( Y E0) . of “success” on the training set. In some practical cases, it will be possible to estimate this in advance by trying out a number of training sets x and observing for how many of them we get training error 5 yf. In other cases, one may be able to prove (or feel reasonably sure) that the probability is close to one, or at least not smaller than 1/2. In the following sections we will derive several results having essentially the same form as Theorem 1. The above remarks on interpretation are relevant for these results as well. 5 Reduction to the VC Dimension of F+
To obtain generalization bounds valid for the 0, criterions defined in the previous section, we need the following lemma. Note that the lemma implies the inequality VCdim[O,(F)]5 2VCdim(F+). It is easily seen that VCdim[O,(F)]is equal to the so-called band dimension of F as defined in Anthony et al. (1995). Translating, that paper gives an upper bound on VCdim[O,(F)]in terms of VCdim(F+),which is slightly weaker than the one above. (The two bounds differ by a constant factor.)
Lemma 1. Let K > 0, and let F Ao,(r)(?n)I [&+(m)I2.
C M a p ( X , R) be
a function class. Then
.>
Proof. Define the maps 81.82: F -+ p(X) by
= { P E x If@)I f o ( P ) + M”f) = { p E x If@) >fo(P) - .) Then 0,Cf) = 01u) n 02cf) for eachf E F, and therefore for each x elm
&,(F,(x)
=
card{x n 01Cf)n &(f)
If
E F>
E X”’
VC Generalization Bounds
1283
5 card{xn&Cf) I f ~ F } . c a r d { x n & V I) ~ E F } = A6’1(F)
‘
A6’,(F)
To complete the proof, it is now sufficient to show that Ao,(F)(m)5 AF+ ( m ) for j = 1,2. We will first show that A , , , F , ( ~ )5 A F + ( ~ ) . Let x E X”’ be fixed, and choose a finite set C F such that A,,(,)(x) = A @ , ( F ) ( X ) . Let
<
do
= minCf(x,)-fo(x,)
If
--IC
E <,1 5 i 5 m and f ( x , )-fo(x,)
Define the injection 4: X”‘+ (X x R)“‘ by eachf E ( and 1 5 I 5 m, we then have XI
E
6Cf)
* *
f(XI) Ifo(X1)
+
-K
> 0)
4 ( ~=) (~x , , f ~ ( x+, ) K + do). For
K.
f(&)
* ~ ( X ) Ef’ I
If
It follows that card{q5(x)nf+ E (} = Ao,(,,(x). So A F + ( ~ ( x= ) ) As,(F)(x). Since x E X”’was arbitrary, it follows immediately that A,,,F)(m) 5 AF+ ( m ) . 0 The proof that AV2(,)(m) 5 A,+(m) is similar. Let x E XI”be fixed, and choose a finite set ( C F such that &,(~)(x) = Ao,(F)(x). This time, define 4: XI”-+ (X x R)”’ by $ ( x ) , = (x,.fo(x,)- K ) . Then for eachf E and x E X”‘
-
<
f(&) 2fo(x,) d ( 4 , $f+ My) Thus @ ( x ) nf+ is the complement of { I I x , E &(f)} in (1... . , m}. Again it x, E
follows that card{4(x) nf+ case of A,,,,,(m).
f$
If
E
F}
= AS,(F)(x). The
rest is similar to the 0
To use Lemma 2, we need bounds of A,+(m). The simplest case is when F is a vector space of dimension d . Then VCdzm(F+) = d, as is essentially shown in Cover (1965). A proof is also given in Haussler (1992). 6 First Example
In this section, I will derive an upper bound of A,+ ( m ) in the case where F represents the function class defined by a network architecture with the following properties: 1. The architecture has a single input node, one hidden layer with n nodes and a single output node. 2. The activation function in each computation node is h ( t ) = erf(t), where erf denotes the error function [i.e., the integral of the normal distribution N(O.l), with h ( 0 ) = 01. The hidden nodes have no thresholds.
Arne Hole
1284
To be precise, we let F form r
11
C M n p ( R .R)be the class of
all functionsf on the
1
where a. bl. cl.. . . .bl,.c,, E R. Thus the elements in F are analytic functions from X = R to Y = R.
Lemma 2. For
in
2 411, the function class F defined in this section satisfies
This lemma is proved in Appendix A. Note that the total number of parameters in the architecture defining F is W = 2n 1.
+
7 Second Example
~
In this section, I will estimate &-+(in) in the case where F is the function class defined by another special kind of network with one hidden layer. Let 17. k 2 1 be integers, let cj: R --t R be increasing, and let h l . . . . . k,l: R -- R be piecewise linear funcions with 5 knots each. For fixed n. k. o and 171. . . . . hl,, let F C Map( RA.R ) be the class of all functions f: Rk i R on the form
E R for all p i ) , The class F can be interpreted as the funcwhere w,,,, tion class defined by a layered network architecture with the following characteristics. 1. The architecture has k input nodes, one hidden layer with n nodes, and a single output node. 2. The activation function in hidden node number i is lzl, for 1 5 i 5 n. 3. The activation function in the output node is o.
Lemma 3. For satisfies
where W
=
nk
tii
2 W, the function class F described in this section
+ 212 + 1 is the number of parameters in F .
This lemma is proved in .4ppendix B. It may be remarked that the proof of Lemma 3 quite easily can be generalized to the case where the activation functions h, of the hidden nodes are piecewise polynomial functions of degree 5 d, where d 2 1. The details are omitted. Maass (1993) gives bounds quite similar to Lemma 3, which are valid for more than one hidden layer as well.
VC Generalization Bounds
1285
8 VCh-Dimension
Note that the method used to prove Lemma 2 in Appendix A depends strongly on the properties of the particular “sigmoid” activation function k(t) = erf(t) considered. There exist other sigmoid-looking functions for which the bound of the lemma is utterly false. In Sontag (1992), there is even constructed an analytic, sigmoid-shaped, strictly increasing function h: R R such that the class F C Map(R. R) of functionsf of the form ---f
f ( t ) = k(wf) + k ( - w t ) where w E R is the only parameter, satisfies VCdim(F+) = co. Examples of this type indicate that in order to obtain VC generalization bounds valid for real valued networks using (say) general ”sigmoid-shaped” activation functions, we must change our setup somewhat. To this end, we will now define a more ”rigid” version of the VC dimension concept for function classes. Let F. H C M a p ( X ,R), and let S 2 0. The class H is said to be &dense in F if for everyf E F there is anf’ E H such that
Define
A!+(m)
= inf(A,t(m)
IH
is b-dense in F}
Note that A!+(m) = A,+(m). We define VC6dirn(F+) to be the largest integer m such that Af+(rn) = 2”’. If no such m exists, VChdim(F+) = 00. The quantity VCsdim(F+) bears some relations to the so-called fatshattering function fatF(b)considered in Bartlett et al. (1994), Bartlett and Long (1995), and some of the references therein. The definition of fat,(h) is as follows. Let us say that x E X”’ is &shattered by F if there exists r E R“’such that for each b E (0, l}’”, there is anf E F with the following property: We havef(x,) 2 r, + S if b, = 1, andf(x,) 5 r, - 6 if b, = 0. Now fatF(&)is defined as the largest integer rn such that F &shatters some x E X”’. From this definition, it is clear that VC&dim(F+)2 fatf(y) for all 0 5 6 < y. For if x E X”’ is y-shattered by F with respect to r E R”‘ (cf. the definition) then every class H that is &dense in F must satisfy AH+(x;r ) = 2”’. Hence VCs dim(F+) 2 In. There may be a similar bound in the opposite direction as well, but I leave this open here.
Theorem 2 (Sharp, noiseless learning). Let A = ( X . P, R,F.fo. S. m. A. v ) bea Q L situation. Let y E [O. l),f E (0, l),IF > 0, b 2 0, and rn 2 4/((1- Y ) ~ c ) . Then
Arne Hole
1286
In other words,
Proof. Choose H such that H is h-dense in F and A H + ( 2 r n ) = At+(2m). For givenf E F , choosey E H such that If'(p) -f(p)I 5 6 for all p E X . Then for all p E X we have
p
6
If'(p) -fo(p)I >
'K-n(f')
--I'
*
-
-f(P)I + If@)-fo(P)I > h - 6 If@)-fo(P)I > h^ - 2 6 "(P) p
6Oh-26Cf)
Hence ECf', Oh-,5> x ) I E ( f , OK-26. x ) 5 y.In the same manner as above, one can show that E ( f , O K ) 5 ECf',O,_,). So if E ( f , 0 K - 2 6 . x ) 5 -yf and E ( f . 0 , ) > F , then ECf'?O,-b,x) I ~t and E ( f ' , O K - h ) > t. The result now follows by applying Theorem 1 to the QL situation A' obtained from A by replacing F by H and X by [j, under the criterion 0 h - b . 0 9 Third Example
To estimate A;+ ( m ) for a given function class F , a natural strategy is to find a class H such that (1)H is &dense in F, and ( 2 )we are able to bound AH+(m).In this section, we will estimate A;+ (m)in the case where F is the function class defined by a quite general network architecture with one hidden layer, using a class covered by Lemma 3 as H. Let F be defined as in Section 7, except that now (1) we allow the activation functions h, in the hidden nodes to be arbitrary functions, ( 2 ) we assume that there is a real constant M such that 11
CIWZII
IM
1=1
for all f E F , and (3) we assume that the activation function 6, of the output node satisfies the Lipschitz bound I6,(tl) - 6,(f2)15 It1 - f21 for all tl. f 2 E R. Note that as in Section 7, the total number W of parameters in F is given by W = nk + 2n + 1. Combined with Theorem 2, the following lemma yields a generalization bound valid for the class F .
Lemma 4. Let F be a function class of the type described above. Suppose that for each 1 5 i 5 n there is a piecewise linear function g, with s knots such that Ig,(t) - k , ( t ) (5 6 / M for all t E R. Then for rn 2 W we have
Proof. Define $: F + M a p ( X . R) by letting $(f) be the function obtained from f by replacing h, with g, for all i. Put H = $ ( F ) . Let f E F have
VC Generalization Bounds
1287
parameter values w,and put
A ( p ) = Wlr +
k
Cwr,p' /=1
By utilizing the Lipschitz bound on 4, we see that for all p
EX
Hence H is h-dense in F . By Lemma 3, A H + ( mI ) (ern/k)(s++')w. The result 0 follows by the definition of A!+ ( m ) . 10 Noise and General Loss Functions
In this section I will describe how the preceding results can be adapted to situations where a fixed, noiseless target functionfo is not given, or where one works with a "nonsharp" learning criterion which cannot be expressed in terms of a map 0 of the type we have been considering. By a noisy learning situation (NL situation) we will mean an 8-tuple
A
=
(X.Y. P. F . S. m,A. v )
where
X is a set (called the input space) Y c R is a set (called the output space) P is a probability measure on X x Y F c M a p ( X . Y) is a function class S is a set m 2 1 is an integer A: (X x ")I, x S + F is a map (called the learning algorithm) v is a probability measure on (X x Y)'" x S such that the marginal of v on ( X x Y)"l is P"' We use the letter Z to denote the product X x Y, and we denote the image A(Z"' x S) by FA.Note that FAG F. As before, we write A ( z , a ) as A.; A map L: R x R + [O, m) will be called a lossfunction provided there is an increasing map pL: [O. 00) + [O. m) such that
L ( a , b ) = P L ( b - bl) Typical examples are L(a, b ) = ( a - b)2 (quadratic loss) and L(a, b ) = la - b( (standard distance loss). We assume in the following that all combinations of loss functions L with NL situations considered are such that for
Arne Hole
1288
all f E F the function (p. t i L ( t . f ( p ) )defined on X x Y is measurable, and such that the standard measurability condition needed for the use of Vapnik’s theorem below is satisfied (cf. the comments in Section 3). Again, these are mild conditions that can be ignored in practice. Given an NL situation .I = (X. Y.P.F.S.m. A. I / ) and a loss function L, for each f E Map(X.Y)and z = (s;y)E Z”‘ we define the error E ( f . L . z ) off with respect to L on z by
For each f E M a p ( X . Y) we define the (global) error E ( f . L ) of f with respect to- L by
Finally, for each t E [O. 1; let
O , ~ ( t . L=) I / { E ( X I . L . Z )5 t } The main difference between a QL situation and an NL situation is that in the latter case the probability distribution P is defined on X x Y instead of on X only. We do not have access to any particular target function fo, and instead we are trying to learn an input-output relatiori on X x Y. Thus the probability distribution P itself plays the role as “target” in an NL situation. The “sharp“ loss function L, defined by
L,(a. b ) =
1 if 10 b/ > 0 if / a - b / 5
i
~
fi K
where i; > 0 is fixed, corresponds to the 0, learning criterion considered in the previous sections. The only difference between the previous setup and the present one is that now the model is designed to treat noisy situations. However, our main result goes through exactly as before:
Theorem 3 (Sharp, noisy learning). Let .I= (X. R.P.F . S . ni. A. 11) be an NL situatioiz. Let E [O. 11,f E ( 0 .l ) ,K > 0 a n d h 2 0. Then
Or other words,
Proof. We have Y = R. Assume first d = 0. Define 8,: F + p ( Z ) by f l K ( f ) = ( ( p . t ) 1 L,(p.t) = 0). Then a variant of Theorem 1 yields the formulas of Theorem 3 with J o h l F , ( 2 m )instead of [&+(2rn)I2 (consult Hole 1995 for details). The proof that A6fi(~)(2n2) 5 [&+(2m)l2in this situation is analogous to the proof of Lemma 1. Define maps & , H > : F + + i ( Z)by & ( F ) = {(p. t ) E X x Y I t 2 f ( p ) - h } and H2(F) = {(p. t ) E X x Y 1
VC Generalization Bounds t
1289
I f ( p ) + K,}. Then
for all (x;y) E (X x Y)". To show that A,,,F,(wz)5 A,+(m), let (x;y) E (X x Y)" be fixed, and choose a finite set E F such that &,(,)(x;y) = &,(F)(X;y). Let
<
do = minCf(xl)- K
-
yI I f
E [, 1 5
i 5 m andf(x,) - K - yl > 0}
Define the injection 4: (X x Y)" + (X x Y)"' by For each f E and 1 5 i 5 m, we then have
<
2,
E
OICf)
*
~ ( X ; Y )= ~
(x,.yl + K +do).
2f(X') - K y i > f ( ~ i ) - ~ - dt--r. ~ 4ix;y)i~f' yl
So card{q5(x;y)nf+ 1 f E <} = &,,,,(x;y). Thus since (x;y) was arbitrary, A~,,,,(rn) L a F + ( m ) . To show &,(,)(m) L &+(m), let again (x;y) E (X x Y)" be fixed, and choose a finite set f C F such that As,(F~(x;y) = As,(F)(x;Y).Define the injection 4: (X x Y)"' (X x Y)"' by q5(x;y)' = (xl.yl - K ) . For eachf E and 1 5 i 5 m, we then have 2' E O 2 C f )
-
<
yl
--f
If(&) + K.
* 4(x;y)1$f'
The conclusion &,(F) ( m ) 5 A,+ ( m ) follows. Then consider the case S > 0. Expressing things in terms of the map 8, introduced above, this follows from the case h = 0 of the theorem by an argument very similar to the proof of Theorem 2. The details are 0 omitted. Now let f : X R be a function. Let L be an arbitrary loss function. For each T E [O. a), let fb = { ( p , t ) E X x R 1 L[t.f(p)] > T } . If F 2 M a p ( X . R) is a function class, let us define FL = 1 f E F and T E [0,m)}. We say that a loss function L is c-Lipschitz if there is a c E R such that ( , u L ( a ) - pL(b)l I cJa - bl for all a, b E [0,m). If the map p L is continuous and strictly increasing, then we call L continuous and strictly increasing (abbreviated CASI) as well. --$
ub
Lemma 5. Let F C M a p ( X , R). Assume that F is closed under addition of constant functions, and that L is CASI. Then
A,, ( m ) L [A,+ (412 Proof. Write Z = X x R. Let z = (x;y)E Z"' be arbitrary, and choose a finite set C F such that A ( E , ~ ( Z=) A,,(z). Let T be a point in the range of p L (clearly it is enough to consider such T). Let a = ;LL'(T), and put
<
b=min((yl-f(xl)l-a
If€<,
1 5 1 5mand (xl>yl)ej?)
1290
Arne Hole
+
where f Y~ and f - yT denote the functions obtained from f by adding the constants yT and -y7, respectively. Observe that z n f: = ( z n p p ) u (Z
nfdown), SO card(z n F L ) 5
If
card{z n f ” P I f E F } . card{z nfdown E F }
5 A,+(z). A,+@). U
Let A = ( X . R,P. F . S.m. A. 1)) be an NL situation. For each P-measurable function 4 on Z = X x R, we may consider the L’ and L2 norms of (?; given by
(allowing the possibility that these might be +m). If L is a loss function and f E F, let L,: Z -.+ R denote the function given by L f ( p .t ) = L ( t . f ( p ) ) . Then I lLrlI1 = Ecf. L ) . If G C M a p ( X . R ) is a function class, let
Note that by Jensen’s inequality, rc 2 1. Let d 2 0. A map q: F + M a p ( X . R ) will be called a (d.L)-balancer for A if (1) / ~ , ~ c f ) (-p f) ( p ) I 5 i~ for allf E F and p E X , and ( 2 ) r+(Fx, is finite.
Theorem 4 (Smooth, noisy learning). Let A = ( X . R. P. F . S . m. A. v ) be an NL situation, let L be a CASI loss funcfioiz, and let d1 be a (h. L)-balancer for A, where h 2 0. Write r = Tq,(F,). Assume that F is closed under addition of constants. Let y E [O, 1), F E (0.1) and m L 4r4/(1 - 7)4. I f d > 0, assume that L is c-Lipschitz. Then
In other words,
This theorem is proved in Appendix C. The bound of Theorem 4 has the advantage over our previous bounds that the expression in the
VC Generalization Bounds
1291
exponent does not depend on F . What gives Theorem 4 its extra strength is the assumption There exists a (6.L)-balancer for A
(10.1)
It is easy to see that if L is a ”sharp” loss function L, of the type considered in Theorem 3, then 10.1 does not hold under any reasonably general conditions. For CASI loss functions such as L(a. b ) = (a - b)2 or L(a. b ) = la - bl, however, the assumption is not so unreasonable. For example, if there is a topology 7 on a class S C Map(X. R) such that t)(F,!,) C S, S is compact, and the real valued functional
is continuous on S, then 10.1 holds trivially. We will consider a situation of this kind in the next section. Concerning explicit estimates of 7q,(FX), assume that we have a map Q: F Map(X.R) such that I$(f)(p) -f(p)l 5 6 for all f E F and p E X. For example, if --f
The random variable Xf(p. t) = t - $(f)(p) is normally distributed with zero mean under P on X x R for allf E FA (10.2) then easy calculations show that for the loss functions L(a. b ) = (u-b)2 and L(a. b ) = (a - bl, we have T ~ , ( F ~=) fi and 7$(rA)= @,respectively. The condition 10.2 may often be a good approximation in practice. Note that 10.2 speaks only about functionsf E FA, i.e., about functionsf E Factually chosen by our learning algorithm X as a result of a “training process.” To assume the condition in 10.2 for all f E F would be very unreasonable. ) other common distributions for Xf(p. t) Similar bounds on T + ( F ~assuming (normal with nonzero mean, uniform, Laplacian) can be found in Vapnik (1982). In all these cases the bounds [which are valid for L(a. b ) = (a-b)*] are smaller than 5/2, independently of the parameters of the distribution. Comments on these matters may also be found in Vapnik and Bottou (1993). 11 Fourth Example Consider an NL situation A = (X, Y ,P, F, S. rn, A, u ) where Z = X x Y = Rk x R, and where we assume that P is defined by a continuous, nonnegative function p(p, t ) from Z to R with compact support. Then for all P-measurable functions 05 we have
.\me Hole
1292
where d p d t denotes ordinary Lebesgue integration. Let F C Mnp(Rk.R) be defined as in Section 7 , except for the following: (1) the activation functions h, of the hidden nodes are allowed to be arbitrary functions, (2) the activation function o of the output node is the identity, and ( 3 ) as parameter space we take a coinpact subset PV of R""2"". Then
is finite. Let L ( n . h ) = )a -- b/ be the standard distance loss function, and let the integer s be such that for each 1 5 i 5 iz there is a piecewise linear function ,pl: R R having s knots with Isl( t ) - It,(t)l 5 h/M for all f E R, where b 2 0. Define I , : F Mnp(X.Y)by letting c-cf) be the function obtained from f by replacing 11, by its approximation ,pi for 1 5 i 5 1 1 . Then we have the following result:
-
-
Corollary 1. Consider the setup described in this section. Then T := T,,,.,~,) is finite. (It depends on P, but P is fixed here.) Let c E (0. l), 6 2 0, 111 2 6 4 ~ ' , 112 2 W, and let W = nk + 211 + 1 be the total number of parameters in F. Then
In other words,
Proof. Let f [ z u ] denote the function in F obtained by using z o E W as parameter. Then the finiteness of J follows because the map from W to R given by
is continuous on the compact space M;. Reasoning as in the proof of Lemma 4, we then see that is a (b.L)-balancer for .I. Note that L is CASI and 1-Lipschitz, and that F is closed under addition of constants. Combine Theorem 4 with the bound on AL.(F1+(m) given by Lemma 3, taking 7 = 112. (.I
Let us consider the following three special cases. In all three cases we take h l = hz = . . = h,,, i.e., the hidden nodes have a common activation function. This function will be denoted by / I . Case 1: h is pieceioise linear. In this case, we can take h as zero and s as the number of knots in h. An easy calculation shows that if 111
2 [256(s + ~ ) T ~InW[256(s ] + l)i-'W]
VC Generalization Bounds
1293
then the bound in the first statement of the corollary is less than e ~ ~ ( ' + ' ) ~ . As a particular example, we may take the popular activation function k defined by k ( t ) = -1 for t < -1, h ( t ) = t for t E [-1.11, and h ( t ) = 1 for t > 1. In this case s = 2. Case 2: k is the truncated sigmoid given by h ( t ) = tanh(-a) for t < a, k ( t ) = tanht for t E [-a,a]. and k ( t ) = tanha for t > a, where a > 0 is fixed. To simplify some estimates, assume E 5 M/2. Choose 6 = ~ / 8 . Let tl. t2 E R with t2 > tl, and let <: R -+ R be the linear function passing through the two points p, = (t,,tanht,) for i = 1.2. Assume that the length of the straight line segment of between pl and p2 is less than or equal to 3@. It is easy to check that the graph of tanht (considered as a curve in R2) has curvature less than 112 for all t. Then by comparing to a circular arc of radius 2 (which has constant curvature 1/2) and remembering that d/dt(tanh t) 5 1 for all t, it follows that
<
) c ( t )- tanht) < 6/M
for all t E [tl.f*]. Using line segments of this type, we can construct a piecewise linear g such that Jg(t)- k(t)J < S / M for all t. We take g continuous, place all the s knots of g on the graph of k, and put g ( t ) = tanha for t 2 a, g ( t ) = tanh(-a) for t 5 -a. The arc length along the graph of h from [-a. tanh(-a)] to (a. tanha) is clearly less than 2a + 2. So s+1=2+-
2a
+2
3@i'
+
(2/3)(a 2 ) a
J;
is sufficient. The inequality of the corollary can now be written
So in this case, if
then the bound in the first statement of the corollary is less than e ~ Case 3: h is the standard sigmoid given by h ( t ) = tanh t for all t. Again assume E 5 M/2, and choose b = ~ / 8 As . before, we choose g constant on each side of an interval of the type (-a,a]. The only difference between this case and the previous one, is that now we need 1 - tanha = 6/M, i. e., a = ln(2M/S - 1). Thus s+l=2+
ln(2M/b - 1)f 2 < 3@
El
n-8M F
is sufficient. (Remember SIM 5 1/16.) The inequality of the corollary
~
~
.
Arne Hole
1294
can now be written
So in this case, if
then the bound in the first statement of the corollary is less than e- "". To sum up, the length n i of the training sequence x needed to have "high" probability of global error E ( A:. L ) < t given that E(X7. L. M ) 5 ~ / 4 , scales at least as well as follows in the three cases considered above:
where K . K' are constants estimated above in each case. (Note that the 2 estimates depend on P. Note also that in fact we worked with ~ / instead of f ;4 in case 1.) Thus the scaling laws obtained in the three cases above are all better than the corresponding ones obtained in Haussler (1992).
Appendix A. Proof of Lemma 2
~
~
_
_
_
By 2.1 of Section 2, it is enough to show that VCdivz(F+) 5 4n. Since erf is strictly increasing, by the observation of Section 2 it is enough to R of the form consider the class G of all functions f : R
-
I1
fipi =a +
b, erf(r,p) 1-1
where a. b,. c l . . . . . bll. c,, E R. Assume that VCdini(G+)= N . Then there exists a sequence (x;y) E (R x R)" with ST < s2 < . . . < xh. and functions h.fi E G such that for all 1 I i I N we have fi (x,)> y, for i odd,f,(m,) 5 y, for i even, whilef2(s,) 5 y, for i odd,fl(x,) > y l for i even. Let g =fi - f 2 . Then g(x,)< 0 for i odd, and g ( r , )> 0 for i even. It follows that g has at least N - 1 zeros in the interval ( S , . S : Y ) R . But then the derivative ,y' must have at least N - 2 zeros in the same interval. Further, g can be written in the form
VC Generalization Bounds
1295
It is known [see Braess (1986), chapter IV, for instance] that exponential sums of the type
IF1
(with a,.@, E R for all i) have at most r - 1 zeros for u E R. Since the map p H p 2 is at most two to one, it follows that N - 2 5 2(2n - l), or N 5 4n. Thus VCdirn(F+) 5 4n.
Appendix B: Proof of Lemma 3
By the observation of Section 2, we may take q5 to be the identity. Let be the knots of h, for each i. For (x;y) E (X x R)"' be given. Let /j,1. . . . eachf E F, consider the sn associated half spaces Hf.lrin Rk consisting of those p E Rk satisfying k
for 1 5 i 5 n and 1 5 r 5 s, where the parameters w correspond to f . Let
8 i r
0,
I xp E Hf.ir} 1f E F } {(r$,, 1 1 5 i 5 n and 1 5 r 5 s} {P
@f.ir
= =
{@f,,r
-
-
Define an equivalence relation on F by f g each combination of i and Y we have (Cover, 1965)
0, = 8,. Since for
it follows that the number K of equivalence classes under
-
satisfies
Let Fo be one of the equivalence classes. Pickfo E Fo. Define an equivalence relation 2 on {xl. . . . x,,,}by letting x,,2 x, iff
.
{(2%r) 1 x p
E Hfi,,ir} =
{ (i. r) 1 x u E Hfi,.w}
Let D1.. . . . DN be the equivalence classes under 2. Let C, be the convex hull in Rk of the set of points in D,, for 1 _< 5 N. Then for each and i, there exist real numbers u: and b: such that
<
<
Arne Hole
1296
for all p E C,,where the parameters ZLJ” correspond tofo. Moreover, the restriction of an arbitrary f E Fo to CE can be written
for the same numbers af and b:. Equation 8.2 can be rewritten in the form
For each pair of integers 1 5 i f : Rk - R by
<
5 N and 1 5 f 5 W, define the map
Then define ( U E C t + R for 1 5 5 W by c.,(p) = i f ( p ) for p E C, Finally, define L . Uc C: R” by putting I ( p ) = ( ~ ‘ ~ ( p ) ~ ‘ ~ ~ (Now p ) ) assume 1,’ t C: Let f E Fo have parameters io Then
-
for 1 5 p 5 111. Hence we see that f induces a particular homogeneously linearly separable dichotomy on the set { c , ( s ~ ) . . . . ~ ~ (in i ,R”. ~ ~ More) }
VC Generalization Bounds
1297
over, this dichotomy uniquely determines (x;y) O f ' . Thus since m 2 W it follows from Cover's formula (Cover 1965) that
But since this estimate will be valid for all the equivalence classes, it follows that &+(x;Y) I K . (em/W)". Substituting B.l, the result follows easily. 0 N
Appendix C: Proof of Theorem 4
First assume 6 = 0. Then Ic, is the identity. Using Lemma 5 to replace the VC dimension bounds, it follows from the proof of Theorem 7.6 (last part) in Vapnik (1982) that for all a 5 1
where r = ~ ~ [ a ~ + ( 2 m ) 1 ~coeis- ~a ~constant ~ ~ ~ / ~and , Vz(a)< 4. Vapnik uses the value 8 for CO. However, Theorem 1 of Section 2 improves the bound of the underlying theorem on uniform convergence by a factor of 4, so we may take co = 2. Assumef E F A is such that E ( f . L ) > f and ECf. L. z ) <: y.Then E c f , L , z ) < y E ( f , L ) , so
Now use C.l, with J;; substituted for V2(a)and 7 - y = r& The condition 7 2 1 ensures that a = (1 - y ) ' / r 2 5 1. To obtain the second statement of the theorem, apply the formula v ( A I B ) = v(An B ) / v ( A )for conditional probabilities, with the obvious choices for A and B. Assume now 6 > 0. Then for allf E F A
In the same manner, one can show that E ( f . L ) 5 E ( d ] ( f ) , L )+ ch. So if E ( f . L ) > E and E ( f . L , z ) 5 YE - 2ch, then E ( d 4 j J . L ) > c - cS and E ( $ C f ) , L , z ) 5 6 7 - c6 5 r(f- ch). The result follows by applying the case 5 = 0 of the theorem to the NL situation A' obtained from A by U replacing F by $ ( F ) and X by o A, substituting f - ch for F. (il
1298
References
Arne Hole -
Alon, N., Cesa-Bianchi, N., Ben-David, S., and Haussler, D. 1993. Scale-sensitive dimensions, uniform convergence, and learnability. Proc. qf the 34th I E E E Synrp. Found. Coi?i/i.Sci. 292-301. Anthony, M., and Bartlett, P. 1995. Function learning from interpolation. In Proceediiigs qfCompiitafioiin/Lmrtiiiig T/1eory EUROCOLT 95. Oxford University Press, New York. Anthony, M., and Shawe-Taylor, J. 1993. A result of Vapnik with applications. Discrete Appl. Math. 47, 207-217. Anthony, M., Bartlett, I?, Ishai, Y., and Shawe-Taylor, J. 1995. Valid generalisation from approximate interpolation. Combiizntorics, Probability Cornpiit., in press. Rarron, A. 1993. Approximation and estimation bounds for artificial neural networks. IEEE Tram. liforiii. Tlicory 39, 930-945. Bartlett, P., and Long, P. 1995. More theorems about scale-sensitive dimensions and learning. In Procerdiiigs c!ftlir 8th Aiiiiiinl A C M Corlft.reiiccor~Conijmtntioirnl Lmrriing Tlzeory. ACM Press, New York. Bartlett, P., Long, P., and Williamson, R. 1994. Fat-shattering and the learnability of real-valued functions. In Proce~diiigsqf tlze 7th Aiiiiiml A C M Corlfrrrrzcr on Coiy~utatiorinlL m r i i i q Theor!/. ACM Press, New York. Bauni, E., and Haussler, D. 1989. What size net gives valid generalization? Neiirol C O ~ ~ 1, I ~151-160. J. ri Springer-Verlag, Berlin. Braes, D. 1986. Nordirienr o p ~ ~ r i i x i n i a t i otlitwy. Cover, T. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Traris. Electroi7. Cornp i i t . EC-14, 326-334. Goldberg, P., and Jerrum, M. 1993. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. In Proceediiigs of tlic 6th Atriiiid A C M Cor?ferericr 0 1 1 Coiripitntiorin/ Lmriiirig Theory. ACM Press, New York. Gurvits, L., and Koiran, P. 1995. Approximation and learning of convex superpositions. In Proceedirigs cf Conipii tntioiinl Ltwrriiiig Tlieory: EUROCOLT 95. Oxford University Press, New York. Haussler, D. 1992. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Iiiforrti.Coiiipiit. 100, 78-150. Hole, A. 1995. Two variants on a theorem by Vapnik. Preprint Series, Institute of Mathematics, University of Oslo. Karpinski, M., and Macintyre, A. 1995. Polynomial bounds for VC dimension o f sigmoidal neural networks. In Procecdiiigs q f i i i c , 27th ACM S ! / V ~ / oii J . Theory of Coniputiq. ACM Press, New York. Lee, W., Bartlett, P., and Williamson, R. 1995. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trnris;.Iiforiii. Theory, in press. Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets. In Proceediiigs of the 25th ACM S p p . o i i Theory of Coiiipiitir~g.ACM Press, New York. Pollard, D. 1984. Coiizlergtwce of Stoclinstis Processes. Springer-Verlag, New York.
VC Generalization Bounds
1299
Sontag, E. 1992. Feedforward nets for interpolation and classification. 1. Comp. Syst. Sci. 45, 2048. Vapnik, V. 1982. Estimation of Dependencies Based on Empirical Data. SpringerVerlag, New York. Vapnik, V., and Bottou, L. 1993. Local algorithms for pattern recognition and dependencies estimation. Neural Comp. 5(6), 893-909.
Received July 6, 1995; accepted February 6, 1996.
This article has been cited by: 2. Wenxin Jiang . 2000. The VC Dimension for Mixtures of Binary ClassifiersThe VC Dimension for Mixtures of Binary Classifiers. Neural Computation 12:6, 1293-1301. [Abstract] [PDF] [PDF Plus] 3. E. Weyer, R.C. Williamson, I.M.Y. Mareels. 1999. Finite sample properties of linear model identification. IEEE Transactions on Automatic Control 44:7, 1370-1383. [CrossRef]
Communicated by Scott A. Markel
The Error Surface of the Simplest XOR Network Has Only Global Minima Ida G . Sprinkhuizen-Kuyper Egbert J. W . Boers Department of Computer Science, Leiden University, Niels Bohrweg 1, 2333 CA, Leiden, The Netherlands
The artificial neural network with one hidden unit and the input units connected to the output unit is considered. It is proven that the error surface of this network for the patterns of the XOR problem has minimum values with zero error and that all other stationary points of the error surface are saddlepoints. Also, the volume of the regions in weight space with saddlepoints is zero, hence training this network on the four patterns of the X O R problem using, e.g., backpropagation with momentum, the correct solution with error zero will be reached in the limit with probability one. 1 Introduction
This paper studies the representation and learning aspect of the simplest feedforward artificial neural network with sigmoid transfer functions that can represent the logical exclusive OR (XOR) function. This network consists of one hidden unit and has connections from the input units to the output unit (see Fig. la). The motivation to study this simple network and the XOR function is partly historical. Since Minsky and Papert (1969) wrote their Perceptrons, the XOR problem continued to be one of the most frequently used examples of a function needing a hidden layer for representation. Given the number of papers using the XOR problem as an example, it may seem strange that up to now, no complete analytical treatise has appeared. The complexity of even the smallest network capable of representing this function is so large that we suspect that analytical solutions of larger, more complex networks will not be feasible. Some general conclusions can, however, be drawn, generalizing the results of this small example. The error of a network is here defined as the difference, in a least squares sense, between the output calculated by the network and the desired output. The error of a network depends on its weights and the training patterns. With a fixed training set the error is a function of the weights: the error suface. The backpropagation algorithm reduces the error in the output by changing the weights-which are randomly Neural Computation 8, 1301-1320 (1996) @ 1996 Massachusetts Institute of Technology
Ida G. Sprinkhuizen-Kuyperand Egbert J. W. Boers
1302
Figure 1 : The simplest XOR network (a) and one with two hidden nodes (b) initialized-in the direction opposite to the gradient of the error with respect to the weights and it stops when the gradient is zero. Distinction can be made between batch learning and on-linc learning. During batch learning, the weights are updated after seeing the whole training set. The errors of the individual training samples are summed to the total error. During on-line learning, the weights are corrected after each sample, with respect to the error for the sample just seen by the network. 1.1 Representation. First we looked at the representational power of the simplest XOR network. It is well known that this network with a threshold transfer function can represent the XOR function and that such a network with a sigmoid transfer function can approximate a solution of the XOR function. In this paper (Section 3 ) we will show that such a network with a sigmoid transfer function can represent the XOR function exactly if TRUE 0.9 and FALSE 0.1 for the output unit.' This result is not trivial, since for a one-layer network2 for the AND function, it is possible to find an approximate representation, but it is not possible to exactly solve the AND function, using a sigmoid transfer function.
--
-
1.2 Learning. When we assume that some kind of gradient-based learning algorithm is used, the shape of the error surface is very important for the ability of the network to learn the desired function. The ideal error surface has one minimum corresponding to an acceptable solution with error zero, and in each other point in weight space a nonzero gradient. With such an error surface each gradient-based learning algorithm will approximate the minimum and find a reasonable solution. How'The values 0 9 and 0 l are used, but all values l - 6 and 6,for some small positive number C, can also be used ?We d o not count the input as a layer of the network
Error Surface of the XOR Network
1303
ever, if the error surface has so-called local minima? then the learning algorithm can wind up in such a local minimum and reach a suboptimal solution. From experiments by Rumelhart et al. (1986) it seems that the simplest XOR network does not have local minima in contrast to the XOR network with two hidden units (see Fig. lb). The problem of whether an error surface for a certain network that has to solve a certain problem has local minima or not-and if they exist, how to avoid them-has been investigated by many researchers (e.g., Gorse et al. 1993; Lisboa and Perantonis 1991; Rumelhart et al. 1986). Most researchers did numerical experiments, which gave a strong intuitive feeling of the existence of local minima, but not a real proof. Lisboa and Perantonis (1991), for example, found analytically all stationary points for both networks of Figure 1, using a logarithmic error function. They claimed a local minimum for the XOR network of Figure lb, with the weights from the hidden units to the output unit equal to zero, while in the Appendix we provide a proof that such a point is a saddlepoint and not a local minimum. Blum (1989) states that the same network with the weights restricted to be symmetrical has a manifold of local minima. The same techniques as used in this paper prove that the points of the given manifold are saddlepoints and not local minima.4 In contrast to Lisboa and Perantonis (1991),who suggest that the simplest XOR network has local minima, this paper will analytically prove that the error surface of the simplest XOR network has no local minima. The global minimum, with zero error, is not a strict minimum, since three-dimensional regions in weight space exist with zero error. All points in a neighborhood of each point in this region have error values that are not less than the error in that point. In a strict minimum, however, all points in a neighborhood should give error values larger than the error value in that point. There exist more stationary points (i.e., points where the gradient of the error is zero), but we were able to prove that these points are saddlepoints. Saddlepoints are stationary points where for each neighborhood both points with larger error values and with smaller error values can be found. Also we proved that the global minimum contains the only points with a gradient equal to zero for the error of all patterns individually. We call such a point a stable stationary point. The saddlepoints have a zero gradient for the error of a fixed training set of patterns, but not for the error of the patterns individually, so on-line learning can probably escape from these points. The results of the analysis of the neighborhood of these saddlepoints give valuable information that can be used to explain the behavior of
3By definition, a global minimum is also a local minimum. However, when speaking about local minima we mean here and in the rest of the paper those local minima that are not global minima. 4Even with the symmetric restrictions! A sketch of the proof is given in the Appendix.
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
1304
Figure 2: The simplest XOR network. learning algorithms and to design learning algorithms that can escape from these saddlepoints. The remainder of the paper consists of the following sections: In Section 2 the XOR problem and the network that is used to implement it are given. In Section 3 it is proven that three-dimensional regions in weight space exist with zero error. In Section 4 it is proven that all stable stationary points with nonzero error are saddlepoints for finite values of the weights and are either saddlepoints or local maxima for infinite values of the weights. Section 5 consists of the proof that all unstable stationary points are saddlepoints. Finally Section 6 contains our conclusions. An Appendix is added with some more theorems and proofs used in the paper. 2 The XOR Problem and the Simplest Network Solving It
The network in Figure 2 with one hidden unit H is studied. This network consists of one threshold unit X o , with constant value 1, two inputs XI and Xz,one hidden unit H , and the output unit Y. The seven weights are labeled zuo, 2/11, ioz, 1 1 ~ 1 , i l l , 112 and 11 (see Fig. 2). If each unit uses a sigmoid transfer function f-the commonly used transfer functionfix) = l / ( l + c - ' ) is discussed a t the end of this sectionthe output of this network is, as function of the inputs XI and X?:
y(
xi.x?)
=fill,]
+ 111 XI + Uzxz + S f ( 7 L J o + iUl xi + i L J 2 x 2 ) ]
(2.1)
The patterns for the XOR problem that have to be learned are PxIx2 = [(XI. Xz). t x , x z ] with input (XI. X,)and desired output tXIX2.For XI. X 2 E {O. l } the desired outputs are too = t l l = 0.1 and to1 = tl,] = 0.9. The error E of the network when training a training set containing oIi times
Error Surface of the XOR Network
1305
Figure 3: The transfer function and its derivatives. the pattern P,. u,, > 0.i, j E (0,I} is
E
=
1 2
+ -21~ o l [ y ( O , 1) 0.9)’ 1 0.912+ - u I I ~ (1) ~. 0.11’ 2
-ao~[y(O.O) -- 0.11’
+
1 -alo[y(l, 0) 2
-
(2.2)
In the remainder of this paper it is assumed that a, = 1, i. j E (0.l}. All proofs that stationary points are saddlepoints do not depend on the values of a,]. Only the error levels corresponding to these stationary points depend on the explicit values of the a,s. The transfer function used is f(x) = 1/(1 e-’). Figure 3 shows the shape off, f’, and f”. On the interval (-m, m) this function is strictly monotonously increasing from 0 to 1. In this paper we will use that 0 < f ’ ( x ) , limx-*t;4f’(x)= 0, f”(x) = 0 H x = 0, and f”’(0) # 0, and the properties:
+
and that the function f has an inverse function: j-’(x) = In
(-)1 - x X
if
o <x <1
3 The Minimum E = 0 Can Occur
The error E consists of four quadratic terms, so E = 0 holds only if all terms are zero. We will distinguish two kinds of minima for the error E:
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
1306 0
Minima remaining stable during on-line learning independent of the chosen training sequence; these minima have the property that no pattern will lead to an error that can be decreased by a local change of the weights. These minima will be called stable nrinima. Minima that are not stable during on-line learning, but (we minima for batch learning. During on-line learning the weights will continue to change in a neighborhood of such a minimum, since it is not a minimum for all patterns separately. These minima will be called rrrrstnble i u i n i m .
If E is equal to zero for all patterns that are in the training set, given a certain set of weights, a stable minimum is found. E can become equal to zero if and only if values of the weights exist such that all four terms in equation 2.2 are equal to zero, resulting in four equations. Application of the inlrerse function f - (see 2.6) on both sides of these equations leads to
'
ri0 110 +
iio 110
+
111
+ z!f(wi,)
=
+ i02)
=
f . ' ( O . l ) zz -2.197 f '(0.9)zz 2.197
+ i"l)
=
f - ' ( 0 . 9 )=: 2.197
=
f - I ( o . 1 ) ZZ -2.197
i 7 f ( illti
+ iiI +
if(i/l,j
+ + 7112)
+ 112
i ! f ( i V ( , il'l
The equations (3.1) are linear in the variables determinant of this set of equations is equal to
ito,
111,
if2,
(3.1) and u. The
So for each combination of values of the weights i(l0, io,, and ivz with this determinant unequal to zero unique values of the other weights iru, if,, 112, and z i can be found such that all equations of 3.1 hold. Investigation of the equation fiiL"1) - f ( i U o
+
i l l l l - f(i"(,
+
illr)
+f(i('(,
+ ill1 + 7 ( h ) = 0
(3.3)
shows that this equation is equivalent to
Since t?.' > 0 equation 3.4 and thus 3.3 have the solutions z('1 =
0 or wZ= 0 or 2 7 +~ ii'l ~ A~ i'.~
=
0
(3.5)
Since equations 3.1 are uniquely solvable for all values of zuO, zol, w2, which are not on the hyperplanes given in 3.5, we will find three-dimensional regions in the seven-dimensional weight space, where E = 0. The region where E = 0 is a global minimum, since for all points E 2 0 holds, E being a positive sum of quadratic terms. Since the dimension of the region where E = 0 is higher than zero, it is clear that the minimum value E = 0 cannot be a strict minimum and there are always points in a neighborhood of a point with E = 0 where the error is also equal to zero.
Error Surface of the XOR Network
1307
4 The Minimum E = 0 Is the Unique Stable Minimum
To obtain a stable minimum, it is necessary that the gradient of the error for each pattern is zero. Writing R , for the terms depending on pattern PI, we obtain
with
Rx,x,
+
+ vf(wo +
= ~ [ U O ulXl+ ~ 2 x 2
x
f"u0
xi+ wzX2)] - txI X Z )
+ UlXl + u2x2 + vf(w0 + WlXl + w2X2)]
(4.2)
The derivative aE/duo is equal to zero for each pattern in the training set if
Roo
= Rol = Rlo = R11 = 0
(4.3)
So all stable stationary points satisfy 4.3. The condition 4.3 is also a sufficient condition for a stable stationary point, since if it holds then the partial derivatives of E with respect to the other weights will be zero too. Clearly the points satisfying 3.1, i.e., the points with E = 0, are stable stationary points. Other stable stationary points can be found when one or more of the arguments of the derivatives of the transfer function in 4.3 (see also 4.2) approach f o o . The corresponding outputs go to zero or one. Let us consider
with one or more of the terms 9i, in the neighborhood of plus or minus infinity. From these equations it follows that
If the determinant 3.2 is unequal to zero it is possible to move the patterns one by one to their desired value by altering v such that the right-hand side of 4.5 moves in the right direction for the considered pattern, while altering uo, u l , and u2 correspondingly to keep the output of the other three patterns constant. If the determinant 3.2 is equal to zero then 900 - 901 - 910 911 = 0 and at least two patterns have a 9ij in the neighborhood of plus or minus infinity for the considered stationary points. The weights U O , u1, and u2 can be used to decrease the error of two patterns at the same time. For example, if 400 and qol are both in the neighborhood of plus infinity, decreasing uo and increasing u l , keeping id2 and uo u1 constant, results
+
+
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
1308
in a path with decreasing error moving 400 and qO1 away from infinity. Other combinations can be treated similarly. So all stationary points with one or more patterns having an output equal to 0 or 1 are not local minima. The stationary points where all four patterns give an output 0 or 1 can be reached only via a path with increasing error: these points are (local) maxima. Conclusion: The w i q w stabie nrininza f o r the considered nefioork for the X O R problem are tkree-dime,zsio~~ai regims in ioeight space with E = 0. The unstable stationary points are treated in the next section. 5 All Unstable Stationary Points Are Saddlepoints
Here all points in weight space with YE = 0, not treated in the previous section, are investigated. The components of YE are 4.1 and (5.1) (5.2)
5.1 The Case il = 0. If i’ = 0 and Roo # 0, equations 4.1, and 5.1 to 5.6 are equivalent to 5.7 and 3.3 and it follows from equation 2.3 that equation 5.7 is equivalent to
Roc,
= =
if(lh) - O.I]f’(Zh,) [f(-uo - u1 i - o.l]f‘i- - l o
=
[f(-uo- t i ? ) - O . l j f ‘ ( - l t o - u 2 ) !f(uo+ 111 11.) - O . l y ’ ( z l ( l + I 1 1 + 1 1 : ) # 0
=
- IfI
Error Surface of the XOR Network
1309
In Theorem A.l of the Appendix we derive that this equation has exactly nine solutions for U O , UI, and U Z . There are three possible error levels: 0.32, 0.407392, and 0.403321. From Theorem A.l it is also clear that Roo > 0 and from 3.1 it follows that E = 0 cannot occur if v = 0. Let us consider the partial derivatives of the error with respect to v, wl,and wzin the stationary points with v = 0. Considering dE/dwland dE/dwz(equations 5.4 and 5.5), it is clear that each term contains a factor v, which will not disappear by taking the partial derivative with respect to w1or w2again. Thus also d‘+IE/awiad2= 0 if i j > 0. Computation of some partial derivatives of E, using equation 5.7, results in
+
and
It is clear that at least one of these terms is unequal to zero, so Theorems A.2, A.3, and A.4 prove that all stationary points with v = 0 are saddlepoints. Figure 4 shows that indeed the error surface behaves as a saddlepoint when in a neighborhood of the point with all weights zero, the weights wo,w l , w2,and v are varied such that Awo = Awl = A w ~ and Av is very small with respect to Aw,. Thus we have proven the following theorem:
Theorem 5.1. Ifv = 0 then all points where VE 5.2 The Case v Unequal to Zero. If v 5.6 are equivalent to 5.7, 3.3, and
#
= 0 are saddlepoints.
0, equations 4.1 and 5.1 to
Substituting the solutions of equation 3.3, given by 3.5, in equation 5.9 and applying the relation 2.5 results in the following four cases satisfying both 5.9 and 3.3:
Case 1: wo= 201 = 202 = 0 Case2: w1 = w2 = 0, wo # 0 Case 3: w2 = 0,w1 = - 2 ~ 0 ,wo# 0 Case 4: w1= 0, zu2 = - 2 ~ 0 ,wg # 0
1310
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
Figure 4: The error surface in the neighborhood of iiu = if1 = ii2 = zoo = ?[’I = i(’2 = i’ = 0. This picture is obtained by varying ic’,), i ( l l , and 7~12equally from -0.5 to 0.5 and u from -0.0005 to 0.0005.
We will show that the stationary points of the first three cases are saddlepoints. The fourth case follows directly from the third case by using the symmetry in 71’1 and ~ 1 1 2 . For the points with ZOO = i ~ ’ 1 = io? = 1 1 1 = 112 = 0 and 110 = -uf(O) belonging to case 1 the second order part of the Taylor series expansion of the error E is
This second-order part contains three quadratic terms, but the Hessian is not positive definite. Inspired by 5.10 we investigated the error surface for all stationary points of cases 1 to 3 in directions such that Lit] z$’(u+~)A701= 0, Azrz 7 f ’ ( 7 ~ 0 ) 1 u 7 2= 0 and 3 . ~ ~ 1 zf’(7~0)Li~~o = 0. Parameterizing these directions with x, y, and z such that AzoO = s, 1 7 1 ’ 1 = y + 2 , Lw2 = -1 - y 4 2, 3140 = 01, Air, = oy oz, and AII:= -ox - oy 0.2 with
+
+
+
+
+
Error Surface of the XOR Network (1
=
1311
- v ~ ( w o gives ) the following expression for the error: E
=
1
-cf[ ug + O X + vf(w0 + x)] - 0.1}2 2
+ 21
-(f[~g
+
+ + ~ f ( w g+ y + z ) ] 0.9}2 +rry +ruz +vf( wg +w1+x+ y +z)] 0.9}2 +202 +vf(wo+wl+ w z+22)] 0.1t2
~2 - NY
CYZ
w2 -
-
+ 21 cf [ 240 +u1+ fYX
+ 51( ~ [ u+ UoI +
~ 2
-
-
(5.11)
For case 1 (wg = w1 = w2 = 0) calculation of partial derivatives of E with respect to z using equation 5.7 leads to d2E/3z2 = 0 and a3E/Dz“ = 6Rggv$”’(O) # 0 for x = y = 0, and thus the stationary points of case 1 are saddlepoints. One of the saddlepoints of case 1 is shown in Figure 5. Figure 6 shows that consideration of the error surface in the direction of each of the weights could suggest that such a point is a local minimum. So it is essential to vary the weights in the right combination in order to be able to visualize that this point is a saddlepoint. For case 2 (w1= w2 = 0, wg # 0) calculation of partial derivatives of E with respect to y and z leads to d2E/3y2 = -d2E/az2 = -2X00vf”(w0)for x = y = z = 0, which is unequal to zero, so either the second-order partial derivative with respect to y or that with respect to z is negative, while the other is positive, and thus also the points of case 2 are saddlepoints. For case 3 (w2= 0, w1 = -2w0, wg # 0) calculation of partial derivatives of E with respect to x and z leads to d2E,Jdz2 = -2d2E/3x2 = -4Rggvf”(wg) for x = y = z = 0, and thus also the points of case 3 are saddlepoints. Thus also the case v # 0 will not result in local minima, and we have proven the following theorem:
Theorem 5.2. I f E # Oand v
# 0, then all points where VE = Oaresaddlepoints.
6 Conclusions
The error surface of the network with one hidden unit for the XOR function has no local minima, only one global minimum with zero error. This minimum value is realized in three-dimensional regions of the sevendimensional weight space. Also a number of two-dimensional regions exist where the error surface behaves as a saddlepoint. The levels of the error surface in the saddlepoints are 0.32, 0.407392, and 0.403321, respectively, for a training set with exactly one example of each pattern. When training is started with small weights, only a saddlepoint with error level 0.32 is possibly reached. The probability that a learning process will start in a saddlepoint is (theoretically) zero since the dimension of
1312 ~~
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
____
0
0 0s
-0 01
Figure 5: The saddlepoint in the neighborhood of uo = - f ( 0 ) , 1 1 1 = 112 = 0, w,)= 7/11 = iu2 = 0, and 11 = 1. This picture is obtained by plotting the error against 111 = 112 = -f'(0)7u1 = -f'(0)7Q and 11,) =f'(0)iuo. The weight S L ! ~ runs runs from -0.02 to 0.02. from -0.1 to 0.1 and the weight
the region consisting of saddlepoints is 2, so its volume as part of the seven-dimensional weight space is zero. A "clever" learning algorithm can escape from a saddlepoint by considering the higher order partial derivatives. Considering the backpropagation learning algorithm, we remark that a batch learning process without momentum can wind up in a saddlepoint, but an on-line learning process can probably escape from such a point since the error surface is not horizontal for each individual pattern; only the average error surface for all patterns is horizontal. So a small change of the weights in the right direction will decrease the error, moving away from the saddlepoint. We did some experiments starting on-line learning exactly in the saddlepoint with all weights equal to zero and found that even with a small value of the learning parameter (0.01) and no momentum term the learning algorithm escaped from the saddlepoint and reached a solution with (almost) zero error in finite time. We also started an on-line learning process many times with the weights randomly chosen from the interval 1-3.31 with learning parameter 0.8 and momentum 0.9 and always found convergence to a solution with error zero. We repeated this experiment
Error Surface of the XOR Network
1313
Figure 6: The error as function of each of the weights in the neighborhood of 110 = -0.5, 111 = 212 = 0, wo = zu1 = wz = 0, and v = 1. This picture gives the (false)impression that the error has a local minimum if 210 = -0.5, 111 = = 0, w[)= zu1 = w2 = 0, and v = 1. Figure 5 showed already that this point is a saddlepoint. starting from random points chosen in the regions of saddlepoints as found in this paper. In this experiment as well, the algorithm converged in all cases. So we believe that on-line backpropagation is able to escape from the saddlepoints. It is a matter for further research to see if it can be proven that (or when) on-line backpropagation can escape from saddlepoints. In this paper distinction is made between stable minima (minima for each pattern) and unstable minima (minima for a training set of patterns, but not for each pattern separately). This distinction is relevant, since if an exact solution can be represented by the network, then only the absolute minima with E = 0 are stable minima and all other (local) minima are unstable. The fact that all local minima are unstable can be exploited by the learning algorithm to escape from these minima. However, it is more attractive to have an architecture of the network such that no local minima occur at all, as is the case for the network studied in this paper for the XOR problem. As is shown by Lisboa and Perantonis (1991), the direct connections from the inputs to the output are important for getting a good architecture for learning the XOR problem. This can be extended to modular network architectures for more difficult problems (see, e.g., Boers et al. 1995). Finding the right architecture and learning algorithm for a problem will remain an important domain of neural network research.
Ida G. Sprinkhuizen-Kuyper and Egbert J . W. Boers
1314
Table 1: Solutions o f Equation A.1
a
b
c
0 PI PI PI -3P,
0 -2P, -2P, 2P, 2P,
0 -2P, 2P, -2P, 2P,
-a-b 0
P, PI -3P,
PI
- a - ~ a+b+c 0 PI -3P, P, PI
0 -3P, PI
PI p,
In this paper we used the quadratic error function. In the literature
(e.g., Lisboa and Perantonis 1991) the error function (6.1)
is also used, where ( t is the index of the pattern and t“ is the desired output. The stationary points for the error function E’ are a subset of those for the quadratic error function. All computations needed to prove that the stationary points with the quadratic error unequal to zero are saddlepoints also hold when using E’. The only difference in the computations is that in the Coefficients R,, the factor containing the derivative of the transfer function disappears. A consequence of this alteration is that the equation Roo = -Rol = -Rlo = XI1 for VE = 0 has exactly one solution and not 9 as in the case considered here.
Appendix: Some Proofs and Theorems Appendix A.l A Result on Error Levels of the Saddlepoints. The coefficients Roo, R(,,, Rlo, and RI1 as defined in equation 4.2 have the form g(x) = f ( x i - O . l ) f ’ ( s ) with x some function of the weights. Carefully considering the cases where CE = 0 makes clear that in all these cases equation 5.7 results in
g(a)= g( -a - b ) = ,y( -a
-
c ) = g(a + b i c )
(A.1)
with u , b, and c functions of the weights. Investigating this equation a bit deeper we derived the following theorem:
Theorem A.l. L e t g ( x ) = [fix) - O.l]f’(x),and let PI zz -1.16139 and Pz zz -1.96745 be the noizzero solutions ufthe equation h 2 ( x )= g(x) - g(-3x), then the set of eqiiafions A . l has nine solutions, -cdiicharc gizlen in Table 1 ( P , stands for P I inn Pz, respectizielyi. For all sollitions g ( a ) E {g(0).g(Pl).g(Pz)} = (0.1.0.025132.0.0024389)hoIds.
Error Surface of the XOR Network
Figure 7: The functionsgjx) = [ f ( x )-O.l]f'(x), g(x) - g ( - 3 x ) .
1315
hl(x) = g(x) - g ( - x ) and h*(x) =
The error levels corresponding to points with values for a, b, and c given in terms of 0, P I , and P2 are 0.32, 0.407392, and 0.403321, respectively. Proof. The values a = b = c = 0 certainly result in a solution. From the definition of g(x) and equation 2.4 it follows that
Since f(x) is monotonously increasing from 0 to 1, it is clear that g ( x ) has exactly one zero point, where f (x)= 0.1, and one maximum and one minimum and that limx+img(x) = 0 (see Fig. 7). For each value of g(a) at most two different points P and Q exist such that g ( P ) = g ( Q ) = g(a). Since a, -a - b, -a - c, and a + b + c cannot all be negative, only the region with g(a) > 0, thus a >f-'(O.l), has to be investigated. All possibilities are tested on the equality a+(a+b+c) = -[(-a-b)+(-a-c)], which results in conditions on P and Q. To obtain an extra solution it is necessary that either for some value of x # 0 the relation g ( x ) = g ( - x ) or g(x) = g(-3x) holds. From the graph of kl (x) = g ( x ) - g(-x) (see Fig. 7) it is clear that kl (x)is not equal to zero if x # 0. The function k*(x) = g(x) -g(-3x) (see Fig. 7) is equal to zero if and only if x is equal to one of the values in the set {O.P1.Pz} = (0, -1.16139, -1.96745). Checking these possibilities results in the conclusion that the nine solutions represented in Table 1 are the only solutions of g(a)=g(-a-b) = gi-a - C) = g(a + b c). 0
+
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
1316
Appendix A.2 Some Theorems Proving That Certain Points are Saddlepoints. Theorem A.2. Comider the firnction q of tztw zlnriables n nnd b it7 the neighborhood of a point iolrere T‘q = 0. ~tI’q/i)n’ = 0 mid a2q/da3b # 0, then the fiinction q has n saddlepoint and IZO extreme in that point. Proof. If for a function q(a. b ) of two variables Gq = 0 holds in a certain point, then the function is approximated by the second- and third-order terms of the Taylor series expansion:
Assuming d2qlOn’ 11’ results in
=
0 and i)‘q idndb
# 0, and taking l a
=
(1 Y
and Ab =
If d2qli)a8b # 0 then values of ( I # 0 and ,j# 0 can be found such that the coefficient of x’ is unequal to zero. Thus Aq will have values with 0 opposite sign for x < 0 and x > 0. Theorem A.3. Let q be a .fiiizctionof three varinbles 0, b, and c. I f in n point = 0, W q / d a ’ d b ’ = 0, for.0 < i + j < 6 and D”q/i)ai)bNc # 0 (or i)’qlda?dc # 0 or 23i7/db2k # O), their q lrns a saddlepoint and not a n extreme iM that point.
with Vq
Proof. Consideration of the Taylor series expansion as function of s,with La = ox, Ah = dx, and Lc = -:s3results in
If i)2q/i)a0c # 0 or i)2q/ObDc # 0 then Theorem A.2 tells that the considered point is a saddlepoint. If both terms are equal to zero, then the coefficient of x5 is decisive if it is unequal to zero. If 0 3 q / i ) n 2 ~ # c 0, or i)’q/ilaDbDc # 0, or d’qqli)b’i)c # 0 the coefficient of x5 is not identically zero and so nonzero values of 0 , .j, and can be found such that the coefficient of x‘ is unequal to zero. Thus q can attain both higher and lower values for small values of x and the point considered is a saddlepoint.
Error Surface of the XOR Network
1317
~
Figure 8: The XOR network with two hidden units. Theorem A.4. Let q be afunction ofthree variables a, b, and c. I f in a point with T'q = 0, a'iJq/da'a@ = 0,for 0 < i + j < 8 and 34q/3a2dbac# 0, then q has a saddlepoint and not an extreme in that point. Proof. The proof is analogous to that of the previous theorem. We will take Aa = OX, Ab = Dx and Ac = yx4, leading to the expansion:
If the terms with x5 or x6 are unequal to zero, Theorems A.2 or A.3 can be applied. The coefficient of x7 is not identically zero if 3*q/3a2db3c# 0, and thus the theorem is proved. Appendix A.3 Some Results for the XOR Network with Two Hidden Units (see Fig. 8). We will prove that stationary points with v1 = 0 and/or v2 = 0 are saddlepoints. For the quadratic error function, the partial derivative with respect to wI7 is equal to
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
1318
Figure 9: The error surface in the neighborhood of the point with weights 11 = ill = i’z = i / ~ l l= XWI = 0, i/l”l = 1.50931, Z021 = 0.48349, 7(’(12 = -0.89611, ZUl2 -0.57221. This saddlepoint view is obtained by varying A Z Jfrom ~ -0.0005 to 0.0005 and hiuol = Arcll~= A i ~ z lfrom -0.5 to 0.5. i-
with ( f , ) , ~= tll = 0.1 and tol = tlo = 0.9)
and
Error Surface of the XOR Network
1319
Thus, if S11 # 0 at least one of these terms is unequal to zero and the considered points are saddlepoints by Theorems A.2, A.3, and A.4. If Sll = 0 and S ~ #O0 consideration of dE/dw:,dvl and d4E/dw~,dv1leads to the same conclusion. If S1l = 0, Slo = 0, and Sol # 0 the partial derivatives with respect to w21 instead of w11 can be considered to prove that these points are saddlepoints and if S1l = SIO= Sol = 0 and SOO# 0 the partial derivatives with respect to wo1 lead to the same result. Finally, a point with finite weights and S1l = SIO= Sol = SOO= 0 both should have error zero and cannot occur if v1 = 0. Thus all stationary points with finite weights and v1 = 0 are saddiepoints. From symmetry it is clear that the same holds if v2 = 0. Lisboa and Perantonis (1991) mentioned that the point in weight space with u = vl = v2 = wll = w22 = 0.0, wo1 = 1.50931, w21 = 0.48349, w 0 2 = -0.89611, and w12 = -0.57221 is a local minimum. They used the error function E’ given in equation 6.1, but the proof given above that this point is a saddlepoint can be transformed into a proof for this error function by skipping the factor with the derivative off in SXIX2. Figure 9 visualizes a neighborhood of this point in such a way that it is clear that this is a saddlepoint indeed. Blum (1989) found a manifold of local minima for the network of Figure 8 where the weights are to be symmetrical, so VI = v2 = u,wol = = wl, and w21 = w12 = w2. The desired output wo2 = WO, w11 = of the patterns Po0 and P1l is equal to tl and that of the patterns Pol and Plo is equal to t2. The manifold mentioned by Blum is given by wo = w1 = w2 = 0 and f ( w - u ) = (tl t2)/2. For the point with v = 0 on this manifold, consideration of the partial derivatives with respect to w1, w2, and v results in the proof that this point is a saddlepoint, analogous to the proof given above. A careful analysis of the Taylor series expansion up to order 3 in directions where the second-order part of this expansion is zero leads to the result that the other points on this manifold are also saddlepoints. A complete proof can be found in Sprinkhuizen-Kuyper and Boers (1994).
+
Acknowledgments We would like to thank the referees for their valuable suggestions to improve this paper.
References Blum, E. K. 1989. Approximation of boolean functions by sigmoidal networks: Part I: XOR and other two-variable functions. Neural Cornp. 1, 532-540. Boers, E. J. W., Borst, M. V., and Sprinkhuizen-Kuyper,I. G. 1995. Evolving artificial neural networks using the “Baldwin effect”. In Artificial Neural
1320
Ida G. Sprinkhuizen-Kuyper and Egbert J. W. Boers
Nets atid Getietic Algorithnis, D. W. Pearson, N. C. Steel, and R. F. Albrecht, eds., pp. 333-336. Springer-Verlag, Wien. Gorse, D., Shepherd, A., and Taylor, J. G. 1993. Avoiding local minima by progressive range expansion. In Procerditigs qf the Iiitevtintioiinl Confrretice 017 A r f i f c i a l N [ w d Nrti[Iorks, S. Gielen and B. Kappen, eds. Springer-Verlag, Berlin. Lisboa, I? J. G., and Perantonis, S. J. 1991. Complete solution of the local minima in the XOR problem. Nt'ti0ot.k 2, 119-124. Minsky, M., and Papert, S.1969. Percqvtrorts. MIT Press, Cambridge, MA. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Pnralltd Distributed Processing, J. L. McClelland, D. E. Rumelhart, and the PDP research group, eds., Vol. 1, pp. 318362. MIT Press, Cambridge, MA. Sprinkhuizen-Kupper, I. G., and Boers, E. J. W. 1994. A coriirtirrit oti n paper of Aliim: Blunz's "local iiiiizitiin" are Saddle Poirits. Tech. Rep. 94-34, Dept. of Computer Science, Leiden University, The Netherlands.
Received January 11, 1995, accepted Februar) 6,1996
This article has been cited by: 2. Dazi Li, Kotaro Hirasawa, Jinglu Hu, Junichi Murata. 2003. Hybrid Universal Learning Networks. IEEJ Transactions on Electronics, Information and Systems 123:3, 552-559. [CrossRef] 3. Malik Magdon-Ismail , Amir F. Atiya . 2000. The Early Restart AlgorithmThe Early Restart Algorithm. Neural Computation 12:6, 1303-1312. [Abstract] [PDF] [PDF Plus] 4. I.G. Sprinkhuizen-Kuyper, E.J.W. Boers. 1999. A local minimum for the 2-3-1 XOR network. IEEE Transactions on Neural Networks 10:4, 968-971. [CrossRef] 5. M. Gori, Ah Chung Tsoi. 1998. Comments on local minima free conditions in multilayer perceptrons. IEEE Transactions on Neural Networks 9:5, 1051-1053. [CrossRef]
Communicated by Paul Viola
Statistical Approach to Shape from Shading: Reconstruction of Three-Dimensional Face Surfaces from Single Two-Dimensional Images Joseph J. Atick Paul A. Griffin A. Norman Redlich Computational Neuroscience Laboratory, The Rockefeller University, 2230 York Avenue, New York,NY 20021-6399 USA The human visual system is proficient in perceiving three-dimensional shape from the shading patterns in a two-dimensional image. How it does this is not well understood and continues to be a question of fundamental and practical interest. In this paper we present a new quantitative approach to shape-from-shading that may provide some answers. We suggest that the brain, through evolution or prior experience, has discovered that objects can be classified into lower-dimensional objectclasses as to their shape. Extraction of shape from shading is then equivalent to the much simpler problem of parameter estimation in a low-dimensional space. We carry out this proposal for an important class of three-dimensional (3D) objects: human heads. From an ensemble of several hundred laser-scanned 3D heads, we use principal component analysis to derive a low-dimensional parameterization of head shape space. An algorithm for solving shape-from-shading using this representation is presented. It works well even on real images where it is able to recover the 3D surface for a given person, maintaining facial detail and identity, from a single 2D image of his face. This algorithm has applications in face recognition and animation. 1 Introduction Our brain is remarkable in its ability to perceive a three-dimensional world from the two-dimensional images projected on the retina (Ramachandran 1988; Todd and Mingolla 1983; Mingolla and Todd 1986; Gulick and Lawson 1976). How it achieves this task is poorly understood and continues to be an active topic within the neuroscience and machine vision communities. What is well understood are the cues it uses to extract a three-dimensional (3D) interpretation. Many of these cues have been exploited by artists over the years to add realism to their work. For example, shading patterns, first used in Neural Cornpiitation 8, 1321-1340 (1996) @ 1996 Massachusetts Institute of Technology
1322
J. J. Atick, P. A. Griffin, and A. N. Redlich
the fourteenth century by Renaissance painters, create a vivid impression of 3D shape in two-dimensional (2D) paintings. While there are other important cues that contribute to our perception of 3D spacesuch as binocular disparity and motion parallax-in this paper we are interested in the problem of recovering 3D shape from a single 2D image using shading information only: the so-called shape-from-shading problem (for an excellent explication of this problem, see Horn and Brooks 1989). Shading is the variation in brightness from one point to another in an image. It carries information about shape because the amount of light a surface patch reflects depends on its orientation (surface normal) relative to the incident light. So, in the absence of variability in surface reflectance properties (surface material), the variability in brightness can be due only to changes in local surface orientation and hence conveys strong information about shape. At the outset we should emphasize that shape-from-shading is fundamentally a very difficult mathematical problem. The brightness at a given point fixes only the projection of the surface normal onto the incident light vector (see next section), and hence one cannot associate a unique normal to each point. In fact, if normals are assigned independently at each point, then there is an infinite number of normal vector fields that are consistent with the image brightness data, so the problem appears to be ill-posed. Of course, normals cannot be independent since the vector field of a true smooth surface satisfies constraints such as integrability (Guggenheimer 1977). These constraints imply that normals are strongly coupled across the surface and cannot be determined by purely local or point by point analysis. It is this coupling that is at the heart of the complexity of this problem. To make things worse, in typical situations the light source and reflectance properties of the surface are not known and have to be estimated simultaneously with shape. Most algorithms proposed thus far in the literature attempt to make shape-from-shading well-posed by imposing some smoothness constraints that cut the infinite number of solutions consistent with the image data down to the few that satisfy the constraints. This approach, while theoretically promising, in practice suffers from a number of problems. These include sensitivity to smoothness parameters, multiple false solutions, nonrobustness against noise, and generally poor reconstruction for real world images. We should note that most previous shape-from-shading algorithms are intended to be applicable to images of smooth but still otherwise arbitrary objects. Technically speaking, this means that one attempts to estimate shape in a space with an excessively large number of degrees of freedom' from the limited information contained in the image. The 'For a typical image size, the depth function z(x. y) of the surface represents nearly one hundred thousand degrees of freedom. The generic constraints of smoothness are not likely to lower the dimensionality enough if one still insists on being able to represent arbitrary shapes.
Statistical Approach to Shape from Shading
1323
difficulties common to such algorithms directly stem from working in this higher dimensional space. While generality may be a noble aim from a mathematical point of view, it is neither clear that it is practically achievable nor is it obvious that the brain solves its shape-from-shading problems in such a way. It is well known that expectation and prior knowledge of the world can influence our interpretation of sensory data (Gregory 1970; Ramachandran 1990; Anstis 1991). What if it were true that the brain, through evolution or interaction with its environment, has discovered that objects can be classified into object-classes as to their shape? Shape space within each class may then be easy to parameterize and may even be very low-dimensional. If this is true, then shape-from-shading becomes equivalent to estimating a small number of parameters given an image and knowledge of what class the object belongs to--certainly a more tractable problem. One way to explore this idea is to start with an ensemble of shapes of related 3D objects, use standard statistical techniques such as principal component analysis (Karhunen 1946; Loeve 1955; Joliffe 1986), and derive a dimensionally reduced representation for shapes in this class. From a database of several hundred laser-scanned heads’ we carry this out for the class of 3D human heads. We show that principal components provide an excellent low-dimensional parameterization of head shape that maintains facial detail and identity of the person. We use this representation to solve the shape-from-shading problem for any human head; we are able to recover an accurate 3D surface of the head/face of any person from a single 2D image of his face. The organization of this paper is as follows: in Section 2 we define mathematically the problem of shape-from-shading. In Section 3 we derive a statistical parameterization of head-space that we use in Section 4 to solve shape-from-shading and apply it to real images of faces. Some details about the database used to extract the statistical regularities of human heads are given in Appendix A. In Appendix B we describe an algorithm for determining the light source from the image. Technical details related to the derivation of principal components of human heads are relegated to Appendix C. For other approaches to shape-from-shading see Horn and Brooks (1989, and references therein), Horn (1970), Ikeuchi and Horn (1981), Brooks (1982), Pentland (1984, 1990), Lee and Rosenfeld (1989), Zheng and Chellapa (1991); Oliensis (1991), and Lehky and Sejnowski (1988). For other approaches to the construction of low-dimensional parameterizations of shape space see Cutzu and Edelman (1995), Edelman (1995), and more specifically for face shape see Vetter and Poggio (1995, and references therein). *This database was made available to us by the Human Engineering Division of the Wright-Patterson Air Force Base; see Appendix A for further information about the database.
J. J. Atick, P. A. Griffin, and A. N. Redlich
1324
2 The Shape-from-Shading Problem
Mathematically speaking, shape-from-shading is equivalent to an inverse rendering problem. As such we will formulate it within a given rendering model-the so-called Lambertian model of surface reflectance. We should keep in mind that the algorithm presented in Section 4 holds for any other rendering model. With the assumption of orthographic projection and Lambertian surfaces, the rendering equation for a single light source is given by3
where L = (L,. Ll,. L,) is a vector representing the incident light and n'(s.y) is the normal to the surface. //(I.!/) is called the albedo, and it represents the deviation in reflectance properties due to pigmentation or markings on the surface. Finally, R( L. n5) = L . n5 is known as the reflectance map. Strictly speaking this model of image formation includes a hard nonlinearity that sets I to zero at points where the reflectivity is negative; those are the points of self-shadowing. In the rest of this paper we will continue to suppress this nonlinearity from our displayed equations, but the reader should keep in mind that it is implicitly taken into account . We will describe surfaces parametrically. For example, human heads can be described by the function r(H. I ) in cylindrical coordinates, where r is the radius and I and H are the height and angular coordinates, respectively. The Euclidean (s. y. 2 j coordinates of each point on the surface are related to these through V(0.I)
f
[s(H.I).!/(H.lj.z(H.l)]
=
i.xo t r(hr.i)sinti.y,l;~.=,,+r(H.I)cosH]
(2.2)
for some shift st). yo. 20 relating the position of the origin in the two coordinate systems. The local tangent plane to the surface is spanned by the vectors i)V/i)H and iIV/iX. Thus the direction of the normal ns is given by the vector cross-product of these vectors:
dV dV n'iH.Ij x - x dH
(2.3)
dl
It is not difficult to show that the unit normal is niiH.ij
I
= \:'r2
x
in
+ ( 2 12 + Y 2 ( $ j2
(--d r cosH+rsinH. dd
Or Or . -r- dl . ai) -sinH+rcosi))
(2.4)
'The coordinates .r.y are the 2D projections of the Euclidean 3D coordinates (m.I/. z) which the surface, ~ [ x!/I, . is embedded; the z axis is along the optical axis.
Statistical Approach to Shape from Shading
1325
The procedure for rendering is to first compute I(6. l ) using equations 2.4 and 2.1; then use the coordinate transform 2.2 to recover I(x, y). In general this will evaluate I(x,y) at nonuniform values of x.y; we recover the image on a uniform coordinate grid through standard interpolation techniques (see Wolberg 1992). Given the image I(x. y) the problem is to find the surface S, the albedo T j , and the light L that satisfy equation 2.1. In general, these three are unknown and have to be determined simultaneously. In practice, however, one may be able to estimate each of the three quantities separately. For example, one could assume initially the albedo is constant, estimate the light direction, and then use that estimate to find the surface. One may even use iterative schemes that alternate between estimating the light direction and the surface shape (Brooks and Horn 1985). Estimating light source direction turns out not to be a major problem. There are now many successful algorithms for doing so (Pentland 1982; Brooks and Horn 1985; Lee and Rosenfeld 1989; Zheng and Chellappa 1991). In Appendix B we present our algorithm, which is able to determine light direction with an accuracy of better than 5 degrees for images of faces under nonextreme illumination directions. Henceforth, we will not consider this problem and will assume that the light direction has been determined. We will also set the albedo to a constant, and we will return to it in a future publication. Actually, as we will see in Section 4, ignoring albedo turns out to be a very good first approximation for faces. Equation 2.1 can be viewed as a nonlinear partial differential equation for the surface function r(6. l ) .Unfortunately, thinking of it this way is not very useful for real images: Standard methods of numerical integration of differential equations (e.g., characteristic strip method) in practice fail miserably. These methods are inherently too sensitive to noise and they require knowledge of boundary conditions. An alternative formulation is to think of shape-from-shading as an optimization problem where one attempts to minimize the average error 2
dxdy I(x.y) R(L. nS)] S [ with respect to the surface shape. As stated in the introduction, without E
=
-
imposing constraints, this problem is ill-defined: there is an infinite number of solutions. The constraints are typically chosen to ensure smoothness of the recovered surface and to allow one solution to be favored. There is an entire mathematical discipline known as regularization theory that attempts to do this (Tikhonov and Arsenin 1977). The problem is that again in practice, standard regularization does not work satisfactorily with shape-from-shading. Among its problems are sensitivity to choice of constraints and a proliferation of local minima. To avoid these problems, we give up generality and focus on solving shape-from-shading within one object class at a time where knowledge of properties of the class can be used to severely limit the shape degrees of freedom. In this paper we implement this approach for the class of human heads.
J. J. Atick, P. A. Griffin, and A. N. Redlich
1326
3 Low-Dimensional Representation of Human Head Shape
Human head shape is amazingly consistent across billions of people. The gross structure is invariably the same; all people have protrusions we call noses, depressions we call eye sockets, and flatter regions for foreheads and cheeks. Anthropometric surveys (Hursh 1976) have examined the extent of this similarity and have quantitatively confirmed that the variability from one head to another is, in fact, relatively small. Nevertheless, it is these small deviations (on average on the order of a centimeter) that give a face its unique identity. Actually, face shapes are much like imprints on coins, they are generated through small deviations from the large scale structure. This suggests a hierarchical representation for human head shape: In cylindrical coordinates we can describe any given face r ( Q . 0 as a perturbation about a function common to all faces, ro(H./)-the so-called "mean-head":
r(H.
=
ro(H. / )
+p(H. I i
(3.1)
where p ( H . l ] are small fluctuations p / r o < 1 that capture the identity of the person. In this paper we take ro(H. I ) to be the average of the first 200 heads from the USAF database (see Appendix A), ie., 1 w r,l = (3.2)
Er'(O.~) t=l
In general, however, one may use several ros corresponding to the averages of a few clusters. For example, adult males may be described by one cluster while females by another. Furthermore, we should keep in mind that we are free to perform global 3D transformations such as scaling on a given head if necessary to maintain the validity of the expansion 3.1 even for unusually small or large heads. To represent p( H. l ) we use principal component analysis (Karhunen 1946; Lo6ve 1955; Joliffe 1986) and expand the fluctuations in terms of the set of eigenfunctions a1: Y ( H . / i = ro(0.I)
+=p1@#(H.q
(3.3)
I
The eigenfunctions @! are derived from the first 200 heads in the USAF database using the procedure described in Appendix C. They represent an empirically derived basis set that captures the statistical regularities in the database and creates a low-dimensional representation of the data. Among their desirable properties is the fact that the mean square error introduced by truncating the summation in 3.3 is minimal. In Figure 1 we show the mean-head and the first 15 eigenrnodes rendered after adding each mode to ro(H. I ) . These pictures are very revealing; many modes seem to predominantly represent features that we can
StatisticalApproach to Shape from Shading
1327
identify as face components. The claim is that by taking appropriate linear combinations of these modes and adding them to the mean yo, we can generate with good accuracy the shape of any human head. In fact it is this generalization ability of these modes that is crucial to this approach. We have tested this using the remaining 147 surfaces in the database (the ones not used to compute the mean or the modes). Since the eigenmodes are orthogonal, given a surface r'(0,t ) for some t, its eigenmode expansion coefficients are simply
(3.4) where A, are the eigenvalues. From these coefficients the eigenmode representation of this surface is given by equation 3.3. We have computed the error lyactual - pigenmode I/Pctua'for the 147 out-of-sample heads as a function of the number of modes used in the representation. The average over the 147 surfaces and over all points is shown in Figure 2, where we find that the error drops to about 1%by the time we use 40 modes4 However, this should be taken only as a rough indication of the quality of the representation since the error measure does not capture the dependence of the error on spatial position and also it has no perceptual meaning. Although, we can say that when the error is less than 1%,the reconstructed surface and the original are perceptually almost identical. This should be compared with the error when no modes are used, i.e., the distance to the mean-head, which ranges from 10 to 27%; ten percent error is perceptually very significant. We should point out that principal components were used in Sirovich and Kirby (1987) and Kirby and Sirovich (1990) to derive a representation of images of human faces. There, it was shown that eigenmodes (so called eigenfaces) provided an excellent low-dimensional characterization of face images. In this paper we rely on the same principle that makes eigenfaces successful, namely, the fact that human faces whether imaged or as surfaces have few degrees of freedom and thus can be represented with a relatively small number of parameters. Here, of course, we compute eigenmodes for surfaces and not images, so these functions have a different interpretation and utility from those obtained by Sirovich and Kirby. In analogy with eigenfaces one may use the term eigenheads to refer to these modes. 4Noticewhat we have examined here is the out-of-sample generalization error, which is to be contrasted with the in-sample truncation error. The dependence of the latter on the number of modes is known theoretically (rms error is given by the sum of the eigenvalues of the modes that are dropped) and is not of interest in this context. The fact that the out-of-sample error turns out to be very small is a strong indication that the representation has captured the true characteristics of head shape to the point that it can represent any head.
1328
J. J. Atick, P. A. Griffin, and A. N. Redlich
Figure 1: The mean-head surface, 1’0 ( l y p t ’ ? ’ kft-rrrosf c - o r w ) and the 15 most significant eigenheads Q,. To display them, the modes Q, have been added to r,) and then rendered straight-on (see Appendix A for details about database used t o extract these modes).
4 Solving Shape-from-Shading
~_
With the space of head shapes parameterized, any individual‘s head is given by specifying the coefficients o,.t = 1. .N. For N - 200, the number of degrees of freedom in this space is much smaller than those available in a general r ( 0 I ) [at the resolution we are using r ( f l . { ) has
Statistical Approach to Shape from Shading
1329
-
3
2.5 2
i1
k k
w
1.5
dP
1 0.5
0
0
20
40
60
80
100 120 140 160 180 200
Number of Modes
Figure 2: The dependence of the reconstruction error on the number of modes used in the representation. The error is defined as /ractua'- Pigenmode I /ractua' averaged over all points on the surface and over 147 out-of-sampleheads (those not used to construct the eigenmodes) and is displayed as a percentage. 256 x 200 = 51,200 real degrees of freedom representing visible surface]. In this section we consider only the problem of determining the surface; we assume that the algorithm in Appendix B was used to determine the source. We will also set the albedo 71 = 1; we will return to the problem of albedo in a future publication. Thus the only unknowns in the shape-from-shading problem are the eigenhead coefficients a,. To determine a, we can try to minimize the error function
with respect to the coefficients a,, where r(0. e) = ro(0. P)
200
+ Ca,Q,(H. Y) 1=1
and R(L.a) is the reflectance map computed using 2.4 and 4.2. I ( x 0 + r sin 0, Y ) is the image I ( x ,y) sampled in cylindrical coordinates using the radius function r. Notice in this formulation as we vary u, looking for a minimum both R(L. a) and I(xO + rsin 8, t) vary. This has the undesirable effect that very often the algorithm can find a trivial solution that will
J. J . Atick, P. A. Griffin, and A. N. Redlich
1330
-
minimize 4.1, for example, r 0 or x. To avoid this problem, we take advantage of the perturbative nature of human heads in cylindrical coordinates and series expand the image I as follows: l(x,,+rsinH.i)
=
Z(so+r"sinH.t)+i),Z(s"+rosin8.() x
[p@;(H.O]+
(4.3)
Generally, we find the first-order term is sufficient, but slightly better results can be achieved by keeping terms quadratic in fluctuation. This formulation is also computationally efficient since one computes I(so + I',) sin H. f ) and &I( x,)+ r,) sin H. ( ) from I(x. y) once and for all before attempting to determine the coefficients n,. We compute these from I ( s . y) and & I ( x.y ) through linear interpolation. We have used two minimization techniques to determine a,: conjugate gradient and gradient descent (Press ct al. 1992), and they both converge to the same solution. The algorithm was tested on about a dozen images generated by Lambertian rendering of surfaces from the out-of-sample portion of the USAF database. We have also tested it on real images taken from a video camera. In Figure 3 three typical results for the purely Lambertian images are shown. In the first column we show the images I(s.y ) used as input to the algorithm; in the second column we show the rendering of the reconstructed surface after the minimization algorithm converged. The third and fourth columns give a comparison between the image of the original surface and the image of the reconstructed surface at 90'. One can verify again and again that the algorithm is capable of reproducing the original face shape for synthetic Lambertian images.' Of course, there is limited interest in such images beyond academic considerations, and the true test of this shape-from-shading algorithm is how well it works for real images, images taken by a camera. In Figure 4 we show one such test. The face in the center is a TIFF image taken by a video camera and cropped. The algorithm in Appendix B was used to determine the light direction and its strength (the light direction was found to a very good approximation to be straight-on with overall strength 124). For this image the shape-from-shading algorithm converged in 24 iterations of the conjugate gradient minimization. The surface that it extracts from the image is shown from four different points of view in Figure 4. Since the 3D laser scan of this person is not available we leave it to the reader to decide on the quality of the reconstruction. However we should point out that we found this reconstruction is excellent for designing total contact burn masks from a photograph,6 an application this algorithm is currently being developed for. 'Although we have not done systematic quantification of the error in recovery, from the dozen examples that we considered the average error is on the order of 2"/% 'These are facial masks that are \\porn by burn victims during an extended rehabilitation period.
Statistical Approach to Shape from Shading
1331
Figure 3: Three comparisons between surfaces extracted by the shape-fromshading algorithm with the true surfaces. The first column from the left are the 2D images used as input to the algorithm. The second column are the recovered surfaces rendered with a straight-on light. The third and fourth columns give the same comparison but viewed after rotation by 90”. 5 Comments and Relevance to the Brain There are other practical applications to being able to construct a 3D model of a given face from a 2D image. For example, in face recognition, it is often necessary to change the pose of a person in an image to bring it to a canonical frontal pose just as it is necessary to compensate for the size of the head by scaling it to a canonical size (Atick et al. 1995). In Figure 5 we use the 3D model extracted by shape-from-shading in the previous section to generate the different poses from a single image. Of course there are other ways to synthesize different views from a single image. Among these, for example, is the technique of Vetter and Poggio
J. J. Atick, P. A. Griffin, and A. N. Redlich
1332 --
Figure 4: An example of a surface reconstructed from a real 2D image (center). Four views of the reconstructed surface are shown surrounding the original 2D image
(1995) that exploits image transformations that are specific to the object class and learnable from example views of other objects of the same class. This technique, however, requires the derivation of 2D flow fields that map a person’s image into a reference image, which, in general, is very difficult. For other techniques that achieve this see Beymer and Poggio (1995) and Lando and Edelman (1995). We must emphasize that the current algorithm was not optimized for speed or output quality. It is preliminary in the sense that many of its engineering details can be better implemented. Also, within this approach, how well the algorithm works depends on the quality of the eigenhead surfaces, which in turn is a function of the database used. One can go back and rederive the eigenheads from larger and less noisy ensembles of heads (currently there are databases with 50,000 human heads available). Another problem of interest at a more fundamental
Statistical Approach to Shape from Shading
1333
Figure 5: With knowledge of the 3D model of a face and a single image (same as the one in center of Fig. 6) we can generate through texture mapping what this person will look like in any 3D pose. level is that of determining albedo. We intend to come back to this issue in a later publication. Finally, one should explore whether the work presented here has any implications to the way the brain solves its shape perception problems. One may be tempted to argue that algorithms of shape-from-shading may not have any relevance to the brain since it is unlikely that the goal of vision is to reconstruct the outside world. This is a somewhat naive interpretation of what shape-from-shading algorithms are supposed to do. Shape is an intrinsic property of objects independent of the viewing and imaging conditions, and the brain could use some shape-from-shading algorithm to extract an invariant representation of objects, one that does not change as, say, the light changes. The eigenheads are precisely that; they provide an invariant representation whose elements are computable from image data. The output of the elements can then be used in cognitive tasks such as recognition and discrimination. In fact, preliminary work in our laboratory shows that shape information, even rudimentary, provides an additional strong signal over and above the albedo signal that improves discriminability in face recognition tasks. The eigenheads have advantanges, in addition to their computability. For example, being low dimensional they yield better generalization. They also have certain testable implications. For example, since they are specific to the class of heads, one should expect semantic categorization
1334
J. J. Atick, P. A. Griffin, and A. N. Redlich
to affect the perception of shape-from-shading. There is evidence that this is indeed the case in the brain (Churchland et al. 1994). In the famous experiments of Helmholtz and in the more recent ones by Ramachandran, human masks viewed from the inside (concave view) invariably appeared convex with the nose protruding toward the o b ~ e r v e r .The ~ effect disappears when the masks are presented upside down as one would expect if the brain's shape perception mechanism probed by the experiment is specific to the category of heads (upside down heads are not normally encountered and are not expected to belong to the same category as heads). The eigenheads can be used as a tool for probing the neural response even if one does not accept them as the elements of head representation in the brain. For example, in the temporal lobe it is well-known that neurons exhibit strong object specificity and the so called face-cells respond selectively to human faces and heads. What is not clear yet is what properties of faces these neurons are partial to. One can use the eigenheads 1 ' s stimuli to systematically measure the response of face cells. One can even apply albedo maps to the shapes generated or render them under different lighting to determine the relative effect of 3D shape, color, and lighting on the neural response.
Appendix A: The Database To derive the eigenheads we used a database of laser-scanned human heads. The database was made available to us by the Human Engineering Division of the Wright-Patterson USAF base and it is known as the "minisurvey." It contains 347 scanned heads of adult males. Each surface in the database is given as r(H. ( ) with 512 units of resolution for H and 256 units of resolution for I . The data were generated using a CyberWare Laser scanner. The data provided are unprocessed, which means they contain significant noise and gaps in the surfaces. We performed some simple preprocessing to fill in the gaps and to smooth some of the noise. We also cropped the data down to 256 x 200 (angular x height) around the nose since we are interested in the reconstruction of only the facial portion of heads. The back of the head was not used in the analysis presented in this paper. Finally, we have aligned the data using some 3D rigid transformations. It is important to emphasize that we have split the database into two parts: the first, containing 200 surfaces, was used to compute the mean 'Another experiment that could reveal the specificity of the brain's shape perception mechanisms is a variant of the experiments pioneered by Koenderink et al. (1992). In these experiments subjects view images of rendered surfaces and adjust a gauge to reflect the perceived local surtace normal at many different points on the surface. It would be interesting to see if human performance is quantitatively different when the surfaces are actually 3D human heads.
Statistical Approach to Shape from Shading
1335
uo and the eigenmodes Q;. The second, containing the remaining 147 surfaces, was used for out-of-sample testing (see Section 3).
Appendix B: Determining the Source In this appendix we present an algorithm for determining the light source, L, from an image of the face I(x, y). The algorithm is very simple; it uses the mean head surface to model how light and shadow vary over a face as a function of incident light direction L and determines L by minimizing the error function:
E
=
.IdOdL{I[xo+ ro(8.
l )sin 8.01 - L . no}2
(B.1)
where no is the normal vector to the mean surface ro(O, P ) calculated from equation 2.4. The image I[xo ro(O,P)sin H,B] is simply the input image I ( x , y) mapped to cylindrical coordinates using the mean surface YO. We use linear interpolation to evaluate the image at nonintegral positions. Since the energy is quadratic in L, the minimum can be evaluated explicitly; S E / S L , = 0 leads to
+
3
L,
=
C(I#)((n;n,O))-'
03.3)
r=l
with brackets indicating integration over (0, !?). To avoid the rectification nonlinearity in the rendering equations (see discussion below equation 2.1) this integration is restricted to points where I is greater than zero. We can test how well this works by rendering out-of-sample faces at different lights, using them as input to the algorithm, and estimating Lest by solving the above 3 x 3 linear equation. Figure 6 shows a typical graph of the estimation error, defined as the angle between the estimated vector Le5' and the true vector used to generate the image L, i.e.,
The figure is a plot of A@ as a function of the standard illumination angles (Y and y. These angles are related to the light direction by
L,
__
IlLll IlLll
=
sinycosa
=
sinysincr
1336
J. J. Atick, P. A. Griffin, and A. N. Redlich
E s t i n a t ion
90 " nama
Figure 6: The angular error in estimating light source, A@, using the algorithm o f Appendix B, as a function of the incident light direction angles, o and 7 , . For 1 1 and -, < 75' A @ is actually less than 5'.
We find, in the regime I ( t 1 < 75', I-. 1 < 75', A@ < 5' with the typical value for A@ less than 2 . The estimation error goes up to 20" for the extreme lighting conditions of / I := 90- and ,: = 90'. This is not surprising considering that under this lighting \.cry little of the image is visible. Thus, for all lighting conditions of interest, this algorithm works exceedingly well. The reason the algorithm works x'ery well is that incident light is the same for all points on the face and thus can be determined from the large scale shading patterns. Large scale shading for the most part is independent of the identity of the person, and its dependence on the light direction is well captured by the renderings of the mean head, r g . The algorithm is also robust to noise since L is estimated from moments iln,) and ( n ) f i , )that are averaged over a very large number of points. (This number varies since the average is carried out only over the points where I is nonvanishing but for nonextreme lighting conditions it is on the order of 40-50 thousand points.) Other algorithms for estimating light source in the literature include Pentland (1982), Brooks and Horn (1985), Lee and Rosenfeld (1989), and Zheng and Chellappa (1991).
Statistical Approach to Shape from Shading
1337
Appendix C: Principal Components of Human Heads In this appendix we give some mathematical details relevant to the computation of the principal components of human heads. For an excellent exposition on principal components as applied to the analysis of large datasets see Sirovich and Everson (1992). To start let rt(c) = [r’(Q.C) - ro(r3.e)] denote the surface deviation of the tth head surface in the database from the mean, t = 1... . , N. The principal components or the eigenmodes are derived by solving the following set of linear equations:
/dc’R(c. C’)P,(C’)= AlP1(C)
F.1)
where l N X(C.c’) = - rt(c)rf(c’)
C t=l
e)
is the covariance matrix and A, is the ith eigenvalue. Since c = (0. ranges over 256 x 200 = 51,200 (see Appendix A), solving C.l is equivalent to diagonalizing a 51,200 x 51,200 matrix R(c. c’), which is practically impossible even by today’s computing standards. Luckily this diagonalizing turns out to be unnecessary, because of a technique called the snapshot method invented by Sirovich (1987). The method asserts that the eigenmodes are linear combinations of the original surfaces and can be written as
P,(c) =
l N
C c(r’(c)
(C.3)
p=l
Substituting C.3 into equation C.l and exchanging the order of summation and integration, equation C.l can be rewritten as
where
‘s
Rllp= - dcr”(c)rp(c) N is an N x N matrix. This implies that the coefficients a: satisfy
K.5)
Solving this equation is equivalent to diagonalizing an N x N matrix. For N = 200 this can be very easily done. The snap-shot method has turned the nontractable problem of diagonalizing a 51,200 x 51,200 matrix into an equivalent but much more tractable problem involving the diagonalization of a 200 x 200 matrix. To recover the original eigenmodes 9,one first solves C.6 for the coefficients a:. These define the linear combinations of the original surfaces rt(c) that must be combined in C.3 to recover P I .
J. J. Atick, P. A. Griffin, and A. N. Redlich
1338
Acknowledgments We wish to thank Kathleen Robinette, Jennifer Whitestone, Barbara McQuiston, a n d Glen Geisen from the H u m a n Engineering Division of the Wright-Patterson Air Force Base for their cooperation in making the “USAF mini-survey database” available to us. We also wish to thank Marc Rioux for supplying us with a n initial set of laser scanned heads from the Canadian National Research Council, Mitch Feigenbaum for very helpful discussions, Penio Penev for his assistance in aligning the database, a n d Christina Nargolwala for useful comments on the manuscript. This work is supported in part by a grant from the Office of Naval Research contract number N00014-95-1-0381.
References
_____
.-
Anstis, S. 1991. Hidden assumptions in seeing shape from shading and apparent motion. In Rrpresrntatioii i7iicl Vision, A. Gorea, ed., Cambridge University Press, Cambridge, UK. Atick, J. J., Griffin, P. A., and Redlich, A. N. 1995. Face recognition from live video for real-world applications. Adz). Imaging May, 58-62. Beymer, D., and Poggio, T. 1995. Face recognition from one model view. In lCCV Proceeditigs . Brooks, M. J. 1982. Shape from shading discretely. Ph.D. thesis, Department of Computer Science, Essex University, Colchester, England. Brooks, M. J., and Horn, 8. K.P. 1985. Shape and source from shading. In Proc. Inf. loitit Coilf. Artificial Intell. Los Angeles, 932-936. Churchland, P. S., Ramachandran, V. S., and Sejnowski, T. J. 1994. A critique of pure vision. In Lnrge-Scnle Nwronnl Theories qf the Brain, C. Koch and J. L. Davis, eds. MIT Press, Cambridge, MA. Cutzu, F., and Edelman, S. 1995. €.~@ntioii gf slinpr space. Weizman Institute Tech. Rep. CS-TR 95-01. Edelman, S. 1995. Representation of similarity in 3D object discrimination. Neilral Cotnp. 7,407422. Foley, J., van Dam, A., Feiner, S., and Hughes, J. 1990. Conipirfer Graphics, Priticiplrs and Practice. Addison-Wesley, Reading, MA. Gregory, R. L. 1970. The Infelligcwt Eye. McGraw-Hill, New York. Guggenheimer, H. W. 1977. Diferential Gro7iiett-y. Dover Publications, New York. Gulick, W. L., and Lawson, R. B. 1976. Hirninii Stcrropsis: A Psychophysical Analysis. Oxford University Press, New York. Horn, B.K.P. 1970. Shape from shading: A method for obtaining the shape of a smooth opaque object from one view. Ph.D. Thesis, Department of Electrical Engineering, MIT. Horn, B. K. I?, and Brooks, M. J. 1989.Shnpefronr Shadirig. MIT Press, Cambridge, MA.
Statistical Approach to Shape from Shading
1339
Hursh, T. M. 1976. The study of cranial form: Measurement techniques and analytical methods. In The Measures ofMan, E. Giles and J. Fiedlaender, eds. Peabody Museum Press, Cambridge, MA. Ikeuchi, K., and Horn, B. K. 1981. Numerical shape from shading and occluding boundaries. Artificial lntelligence 15, 141-186. Jolliffe, I. T. 1986. Principal Component Analysis. Springer-Verlag, New York. Karhunen, K. 1946. Zur spektraltheorie stochasticher. Prozesse Ann. Acad. Sci. Fennicae 37. Kirby, M., and Sirovich, L. 1990. Application of the Karhunen-Loeve procedure for the characterization of human faces. I E E E Transact. Pattern Analysis and Machine Intelligence 12, 103-108. Koenderink, J. J., van Doorn, A. J., and Kappers, A. M. L. 1992. Surface perception in pictures. Percept. Psychophys. 52, 18-36. Lambert, J. H. 1760. Photometria siue de Mensura et Gratibus Luminis, Coloriim et Umbrae. Eberhard Klett, Augsburg. Translation in W. Engleman (1982), Lambert’s Photometrie. Leipzig. Lando, M., and Edelman, S. 1995. Generalization from a single view in face recognition. Network 6, 551-576. Lee, C. H., and Rosenfeld, A. 1989. Improved methods of estimating shape from shading using the light source coordinates system. In Shape from Shading, B. K. P. Horn and M. J. Brooks, eds., pp. 323-569. MIT Press, Cambridge, MA. Lehky, S. R., and Sejnowski, T. J. 1988. Network model of shape-from-shading: Neural function arises from both receptive and projective fields. Nature (London) 333,452454. Loeve, M. M. 1955. Probability Theory. Van Nostrand, Princeton, NJ. Mingolla, E., and Todd, J. 1986. Perception of solid shape from shading. Biol. Cybern. 53, 137-151. Oliensis, J. 1991. Shape from shading as a partially well-constrained problem. Computer Vision, Graphics, Image Process. lmage Understand. 54, 163-183. Pentland, A. P. 1982. Finding the illuminant direction. I. Opt. SOC.A m . 72, 448455. Pentland, A. P. 1984. Local shading analysis. IEEE Trans. Pattern. Anal. Machine ltell. PAMI-16, 170-187. Pentland, A. P. 1990. Linear shape from shading. lnt. J. Computer Vision 4, 153-162. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. I? 1992. Numerical Recipes: The Art of Scientific Computing, 2nd ed. Cambridge University Press, Cambridge. Ramachandran, V. S. 1988. Perception of shape from shading. Nature (London), 331, 163-166. Ramachandran, V. S . 1990. Visual perception in people and machines. In A1 and the Eye, A. Blake and T. Troscianko, eds., pp. 21-77, Wiley, New York. Simmons, L. 1975. Diffuse reflectance spectroscopy: A comparison of the theories. Appl. Optics 15, 603404. Sirovich, L. 1987. Turbulence and the dynamics of coherent structures. Quart. Appl. Math. XLV, 561-590.
J. J. Atick, I? A. Griffin, and A. N. Redlich
1340
Sirovich, L., and Everson, R. 1992. Management and analysis of large scientific datasets. Znt. J. Siiperconipiiter AppI. 6, 50-68. Sirovich, L., and Kirby, M. 1987. Low-dimensional procedure for the characterization of human faces. J . Opt. SOC.Atiz. a 4, 519-524. Tikhonov, A. N., and Arsenin, V. Y. 1977. SoIiiti~iisqflll-p0sedProblenrj.W. H. Winston, Washington, D.C. Todd, J. T., and Mingolla, E. 1983. Perception of surface curvature and direction o f illumination from patterns of shading. J . E s p . Ps!ydzol. Hiirrinii Percept. Perform. 9, 583-595. Vetter, T., and Poggio, T. 1995. Linear object classes and image synthesis from a single example image. A.I. Memo No. 1531. Wolberg, G., 1992. Digitill lninge Wnrpiizg. IEEE Computer Society Press, Los Alamitos, CA. Woodham, R. J. 1981. Analyzing images of curved surfaces. Art{ficialIwtelligetice 17, 117-140. Zheng, Q., and Chellappa, R. 1991. Estimation of illuminant direction albedo, and shape from shading. l E E E Trnns. Pnfterii Aiinl, Mnclziize Iiitelligence 13, 680-702. ~
TZecei\ed August 8, 1995, accepted February 2, 1996
This article has been cited by: 2. Sushma Jaiswal, Sarita Singh Bhadauria, Rakesh Singh Jadon, Tarun Kumar Divakar. 2010. Brief description of image based 3D face recognition methods. 3D Research 1:4. . [CrossRef] 3. William A. P. Smith, Edwin R. Hancock. 2010. Estimating Facial Reflectance Properties Using Shape-from-Shading. International Journal of Computer Vision 86:2-3, 152-170. [CrossRef] 4. S. Biswas, G. Aggarwal, R. Chellappa. 2009. Robust Estimation of Albedo for Illumination-Invariant Matching and Shape Recovery. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:5, 884-899. [CrossRef] 5. M. Castelán, J. Van Horebeek. 2009. Relating intensities with three-dimensional facial shape using partial least squares. IET Computer Vision 3:2, 60. [CrossRef] 6. William A. P. Smith, Edwin R. Hancock. 2007. Facial Shape-from-shading and Recognition Using Principal Geodesic Analysis and Robust Statistics. International Journal of Computer Vision 76:1, 71-91. [CrossRef] 7. Feng Han, Song-Chun Zhu. 2007. A Two-Level Generative Model for Cloth Representation and Shape from Shading. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:7, 1230-1243. [CrossRef] 8. Mario Castelan, William A. P. Smith, Edwin R. Hancock. 2007. A Coupled Statistical Model for Face Shape Recovery From Brightness Images. IEEE Transactions on Image Processing 16:4, 1139-1151. [CrossRef] 9. Shaohua Zhou, Gaurav Aggarwal, Rama Chellappa, David Jacobs. 2007. Appearance Characterization of Linear Lambertian Objects, Generalized Photometric Stereo, and Illumination-Invariant Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:2, 230-245. [CrossRef] 10. Siu-Yeung Cho, Tommy W. S. Chow. 2006. Robust face recognition using generalized neural reflectance model. Neural Computing and Applications 15:2, 170-182. [CrossRef] 11. A. Nissenboim, A. M. Bruckstein. 2006. Model-based shape from shading for microelectronics applications. International Journal of Imaging Systems and Technology 16:2, 65-76. [CrossRef] 12. Chengjun Liu. 2004. Gabor-based kernel pca with fractional power polynomial models for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:5, 572-581. [CrossRef] 13. Daniel Kersten, Pascal Mamassian, Alan Yuille. 2004. Object Perception as Bayesian Inference. Annual Review of Psychology 55:1, 271-304. [CrossRef] 14. Siu-Yueng Cho, T.W.S. Chow. 2001. Neural computation approach for developing a 3D shape reconstruction model. IEEE Transactions on Neural Networks 12:5, 1204-1214. [CrossRef]
15. D. Nandy, J. Ben-Arie. 2001. Shape from recognition: a novel approach for 3-D face shape recovery. IEEE Transactions on Image Processing 10:2, 206-217. [CrossRef] 16. R. Kuhn, J.-C. Junqua, P. Nguyen, N. Niedzielski. 2000. Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing 8:6, 695-707. [CrossRef] 17. Joshua B. Tenenbaum , William T. Freeman . 2000. Separating Style and Content with Bilinear ModelsSeparating Style and Content with Bilinear Models. Neural Computation 12:6, 1247-1283. [Abstract] [PDF] [PDF Plus] 18. A.B. Jani, C.A. Pelizzari, G.T.Y. Chen, J. Roeske, R.J. Hamilton, R.L. Macdonald, F. Bova, K.R. Hoffmann, P.A. Sweeney. 2000. Volume rendering quantification algorithm for reconstruction of CT volume-rendered structures. 1. Cerebral arteriovenous malformations. IEEE Transactions on Medical Imaging 19:1, 12-24. [CrossRef] 19. Alexander N Jourjine. 1998. Nonlinearity 11:2, 201-211. [CrossRef]
ARTICLE
Communicated by Steven Nowlan
The Lack of A Priori Distinctions Between Learning Algorithms David H. Wolpert The Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, N M , 87501, U S A This is the first of two papers that use off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. This first paper discusses the senses in which there are no a priori distinctions between learning algorithms. (The second paper discusses the senses in which fhere are such distinctions.) In this first paper it is shown, loosely speaking, that for any two algorithms A and B, there are "as many" targets (or priors over targets) for which A has lower expected OTS error than B as vice versa, for loss functions like zero-one loss. In particular, this is true if A is cross-validation and B is "anti-cross-validation'' (choose the learning algorithm with largest cross-validation error). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one cannot say: if empirical misclassification rate is low, the Vapnik-Chervonenkis dimension of your generalizer is small, and the training set is large, then with high probability your OTS error is small. Other implications for "membership queries" algorithms and "punting" algorithms are also discussed. "Even after the observation of the frequent conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience." David Hume, in A Treatise of Human Nature, Book I, part 3, Section 12. 1 Introduction
Much of modern supervised learning theory gives the impression that one can deduce something about the efficacy of a particular learning algorithm (generalizer) without the need for any assumptions about the target input-output relationship one is trying to learn with that algorithm. At most, it would appear, to make such a deduction one has to know something about the training set as well as about the learning algorithm. Consider for example the following quotes from some well-known papers: "Theoretical studies link the generalization error of a learning Neural Computation 8, 1341-1390 (1996) @ 1996 Massachusetts Institute of Technology
1342
David H. Wolpert
algorithm to the error on the training examples and the capacity of the learning algorithm (independent of concerns about the target)”; ”We have given bounds (independent of the target) on the training set size vs. neural net size needed such that valid generalization can be expected”; ”If our network can be trained to classify correctly . . . 1 - (1 - )c of the k training examples, then the probability its [generalization] error is less than :is at least [a function, independent of the target, of t, 7 , k , and the learning algorithm]”; ”There are algorithms that with high probability produce good approximators regardless of the target function . . . . We do not need to make any assumption about prior probabilities (of targets)”; ”To d o Bayesian analysis, it is not necessary to work out the prior (over targets)”; “This shows that (the probability distribution of generalization accuracy) gets concentrated a t higher and higher accuracy \,slues as more examples are learned (independent of the target).” Similar statements can be found in the ”proofs” that various supervised learning communities have offered for Occam’s razor (Blumer rf nl. 1987; Berger and Jeffreys 1992; see also Wolpert 1994a, 1995). There even exists a field (”agnostic learning,” Kearns ct nl. 1992) whose expressed purpose is to create learning algorithms that are assuredly effective even in the absence of assumptions about the target. Frequently the authors of these kinds of quotes understand that there are subtleties and caveats behind them. But the quotes taken at face value raise an intriguing question: can one actually get something for nothing in supervised learning? Can one get useful, caveat-free theoretical results that link the training set and the learning algorithm to generalization error, without making assumptions conctrning the target? More generally, are there useful practical techniques that require no such assumptions? A s a potential example of such a technique, note that people usually use cross-validation without making any assumptions about the underlying target, as though the technique were universally applicable. This is the first of two papers that present an initial investigation of this issuc. These papers can be viewed as an analysis of the mathematical ”skeleton” of supervised learning, before the ”flesh” of particular priors over targets and similar problem-specific distributions is introduced. It should be emphasizcd that the work in these papers is very preliminary; e\’en the “skeleton” of supervised learning is extremely rich and detailed. Much remains to be done. The primary mathematical tool used in these papers is off-training set (OTS) generalization error, i.e., generalization error for test sets that contain no overlap with the training set. (In the conventional measure of gencralization error such overlap is allowed.) Section 2 of this first paper rxplains why such a measure of error is of interest, and in particular emphasizes that it is equivalent to (more conventional) IID error in many scenarios of interest. Those who already accept that OTS error is of interest can skip this section. Section 3 presents the mathematical formalism used in this paper.
Lack of Distinctions between Learning Algorithms
1343
Section 4 presents the “no free lunch” (NFL) theorems (phrase due to D. Haussler). Some of those theorems show, loosely speaking, that for any two algorithms A and B, there are ”as many” targets for which algorithm A has lower expected OTS error than algorithm B as vice versa (whether one averages over training sets or not). In particular, such equivalence holds even if one of the algorithms is random guessing; there are “as many” targets for which any particular learning algorithm gets confused by the data and performs worse than random as for which it performs better. As another example of the NFL theorems, it is shown explicitly that A is equivalent to B when B is an algorithm that chooses between two hypotheses based on which disagrees more with the training set, and A is an algorithm that chooses based on which agrees more with the training set. Other NFL theorems are also derived, showing, for example, that there are as many priors over targets in which A beats B (i.e., has lower expected error than B) as vice versa. In all this, the quotes presented at the beginning of this section are misleading at best. Next a set of simple examples is presented illustrating the theorems in scenarios in which their applicability is somewhat counterintuitive. In particular, a brief discussion is presented of the fact that there are as many targets for which it is preferable to choose between two learning algorithms based on which has larger cross-validation error (“anti-crossvalidation”) as based on which has smaller cross-validation error. This section also contains the subsection ”Extensions for nonuniform averaging” that extends the NFL results beyond uniform averages; as that subsection shows, one can, for example, consider only priors over targets that are highly structured, and it is still often true that all algorithms are equal. Also in this section is the subsection ”On uniform averaging,” which provides the intellectual context for the analyses that result in the NFL theorems. Section 5 discusses the NFL theorem’s implications for and relationship with computational learning theory. It starts with a discussion of empirical error and OTS error. This discussion makes clear that one must be very careful in trying to interpret uniform convergence (VC) results. In particular, it makes clear that one cannot say: if the observed empirical misclassification rate is low, the VC dimension of your generalizer is small, and the training set is large, then with high probability your OTS error is small. After this, the implications of the NFL results for active learning, and for ”membership queries” algorithms and ”punting” algorithms (those that may refuse to make a guess), are discussed. Small and simple proofs of claims made in the text of this first paper are collected in Appendix C. Paper one concentrates on relative sizes of sets of targets and the associated senses in which all algorithms are a priori equivalent. In contrast, paper two concentrates on other ways to compare algorithms. Some of these alternative comparisons reveal no distinctions between algorithms, just like the comparisons in paper one. However some of the other al-
1344
David H. Wolpert
ternative comparisons result in a priori distinctions between algorithms. In particular, it is pointed out in paper two that the equivalence of average OTS error between cross-validation and anti-cross-validation does not mean they have equivalent "head-to-head minimax" properties, and that algorithms can differ in those properties. Indeed, it may be that cross-validation has better head-to-head minimax properties than anticross-validation, and therefore can be a priori justified in that sense. Of course, the analysis of paper one does not rule out the possibility that there are targets for which a particular learning algorithm works well compared to some other one. To address the nontrivial aspects of this issue, paper two discusses the case where one averages over hypotheses rather than targets. The results of such analyses hold for all possible priors, since they hold for all (fixed) targets. This allows them to be used to prove, as a particular example, that cross-validation cannot be justified as a Bayesian procedure, i.e., there is no prior over targets for which, without regardfor the leanizfig alprithnis iir qitestion, one can conclude that one should choose between those algorithms based on minimal rather than (for example) maximal cross-validation error. In addition, it is noted that for a very natural restriction of the class of learning algorithms, one can distinguish between using minimal rather than maximal crossvalidation error-and the result is that one should use maximal error(!). All of the analysis up to this point assumes the loss function is in the same class as the zero-one loss function (which is assumed in almost all of computational learning theory). Paper two goes on to discuss other loss functions. In particular, the quadratic loss function modifies the preceding results considerably; for that loss function, there urc algorithms that are a priori superior to other algorithms. However, it is shown in paper two that no algorithm is superior to its "randomized" version, in which the set of potential fits to training sets is held fixed, but which fit is associated with which training set changes. In this sense one cannot a priori justify any particular learning algorithm, even for a quadratic loss function. Finally, paper two ends with a brief overview of some open issues and discusses future work. It cannot be emphasized enough that no claim is being made in this first paper that all algorithms are equivalent iii prmfticr, in the real world. In particular, no claim is being made that one should not use crossvalidation in the real world. (I have done so myself many times in the past and intend to do so again in the future.) The sole concern of this paper is what can(not) be formally inferred about the utility of various learning algorithms if one makes no assumptions concerning targets. The work in these papers builds upon the analysis in Wolpert (1992, 1993). Some aspects of that early analysis are nicely synopsized in Schaffer (7993, 1994). Schaffer (1994) also contains an inttwsting discussion of the implications of the NFL theorems for real wcrld learning, as does
Lack of Distinctions between Learning Algorithms
1345
Murphy and Pazzani (1994). See also Wolpert and Macready (1995) for related work in the field of combinatorial optimization. The major extensions beyond this previous work that is contained in these two papers are (1)many more issues are analyzed (e.g., essentially all of paper two was not touched upon in the earlier work); and (2) many fewer restrictions are made (e.g., losses other than zero-one are considered, arbitrary kinds of noise are allowed, both hypotheses and targets are arbitrary probability distributions rather than single-valued functions from inputs to outputs, etc.). 2 Off-Training-Set Error
Many introductory supervised learning texts take the view that “the overall objective . . . is to learn from samples and to generalize to new, as yet unseen cases” (italics mine-see Weiss and Kulikowski 1991, for example). Similarly, in supervised learning it is common practice to try to avoid fitting the training set exactly, to try to avoid “overtraining.” One of the major rationales given for this is that if one overtrains, ”the resulting (system) is unlikely to classify additional points (in the input space) correctly” (italics mine-see Dietterich 1990). As another example, in Blumer et al. (1987), we read that ”the real value of a scientific explanation lies not in its ability to explain (what one has already seen), but in predicting events that have yet to (be seen).” As a final example, in Mitchell and Blum (1994) we read that “(in Machine Learning we wish to know whether) any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.” This language makes clear that OTS behavior is a central concern of supervised learning, even though little theoretical work has been devoted to it to date. Some of the reasons for such concern are as follows. 1. In the low-noise (for outputs) regime, optimal behavior on the training set is trivially determined by lookup table memorization. Of course, this has nothing to do with behavior off of the training set; so in this regime, it is only such OTS behavior that is of interest. 2. In particular, in that low-noise regime, if one uses a memorizing learning algorithm, then for test sets overlapping with training sets the upper limit of possible test set error values shrinks as the training set grows. If one does not correct for this when comparing behavior for different sizes of the training set (as when investigating learning curves), one is comparing apples and oranges. In that low-noise regime, correcting for this effect by renormalizing the range of possible error values is equivalent to requiring that test sets and training sets be distinct, i.e., is equivalent to using OTS error (see Wolpert 1994a).
1346
David H. Wolpert
3. In artificial intelligence-ne of the primary fields concerned with supervised learning-the emphasis is often exclusively on generalizing to as yet unseen examples. 4. In the real world, very often the process generating the training set is not the same as that governing testing. In such scenarios, the usual justification for testing with the same process that generated the training set (and with it the possibility that test sets overlap with training sets) does not apply. One example of such a difference between testing and training is “active” or ”query-based” or ”membership-based” learning. In that kind of learning the learner chooses, perhaps dynamically, where in the input space the training set elements are to be. However, conventionally, there is no such control over the test set. So testing and training are governed bv different processes. As another example, say we wish to learn tertiary protein structure from primary structure and then use that to aid drug design. We already k i i w what tertiary structure corresponds to the primary structures in the training set. So we will never have those structures in the ”test set” (i.e., in the set of nucleotide sequences whose tertiary structure we wish to infer t o aid the drug design process). We will only be interested in OTS error.
5. Distinguishing the regime where test examples coincide with the training set from the one where there is no overlap amounts to splitting supervised learning along its natural ”cleavage plane.” Since behavior can be radically different in the two regimes, it is hard to see why one wouldn‘t want to distinguish them. 6. When the training set is much smaller than the full input space, the probability that a randomly chosen test set input value coincides with the training set is vanishingly small. So in such situations one expects the \ d u e of the OTS error to be well-approximated by the value of the conventional IID (independent identically distributed) error, an error that allows overlap between test sets and training sets. One might suppose that in such a small training set regime there is no aspect of OTS error not addressable by instead calculating IID error. This is wrong though, as the following several points illustrate. 7. First, even if OTS error is well approximated by IID error, it does not follow that quantities like the ”derivatives” of the errors are close to one another. In particular, it does not follow that the sign of the slope of the learning curve-often an object of major interest-is the same for both errors over some region of interest. As an example, in Wolpert P t al. (1995), it is shown that the expected OTS misclassification rate can iizcrrase with training set size, even if one
Lack of Distinctions between Learning Algorithms
1347
averages both over training sets and targets, and even if one uses the Bayes-optimal learning algorithm. In contrast, it is also shown there that under those same conditions, the expected IID misclassification rate is strictly nonincreasing as a function of training set size for the Bayesoptimal learning algorithm (see also the discussion in Wolpert 1994a concerning the statistical physics supervised learning formalism).
8. Second, although it is usually true that a probability distribution over IID error will well-approximate the corresponding distribution over OTS error, distributions conditioned on IID error can differ drastically from distributions conditioned on OTS error. This can be very important in understanding the results of computational learning theory. As an example of such a difference, let s be the empirical misclassification rate between a hypothesis and the target over the training set (i.e., the average number of disagreements over the training set), m the size of the training set, cflDthe misclassification rate over all of the input space (the IID zero-one loss generalization error), and cbTs the misclassification rate over that part of the input space lying outside of the training set. (These terms are formally defined in the next section and at the beginning of Section 5.) Assume a uniform sampling distribution over the input space, a uniform prior over target input-output relationships, and a noise-free IID likelihood governing the training set generation. Then P(s I cfrD.m ) , the probability of getting empirical misclassification rate s given global misclassification rate ciID, averaged over all training sets of size m, is just the binomial distribution (c;,D)S’lf (1 - c&,)(I1f-sf”)C’l’ S’lff where C“ = a!/[b!(a- b)!](s can be viewed as the percentage of heads in m flips of a coin with bias cil0 toward heads). On the other hand, P(s I cbTS3m ) , the probability of getting empirical misclassification rate s given off-training sets misclassification rate c&, averaged over all training sets of size m, is independent of c b s . (This is proven in Section 5 below.) So the dependence of the empirical misclassification rate on the global misclassification rate depends crucially on whether it is OTS or IID “global misclassification rate.” 9. Third, often it is more straightforward to calculate a certain quantity for OTS rather than IID error. In such cases, even if one’s ultimate interest is IID error, it makes sense to instead calculate OTS error (assuming one is in a regime where OTS error well-approximates IID error). As an example, OTS error results presented in Section 5 mean that when the training set is much smaller than the full input space, P(c;,, I s , m ) is (arbitrarily close to) independent of s, if the prior over target input-output relationships is uniform. This holds despite VC results saying that independent of the prior, it is highly unlikely for ciID and s to differ significantly. (This may seem paradoxical at first. See the discussion in Section 5 for the ”resolution.”)
1348
David H. Wolpert
The formal identity (in the appropriate limit) between a probability distribution over an OTS error and one over an IID error is established at the end of Appendix B. None of the foregoing means that the conventional IID error measure is "wrong." No claim is being made that one "should not" test with the same process that generated the training set. Rather the claim is simply that OTS testing is an issue of major importance. In that it gives no credit for memorization, it is also the natural way to investigate whether one can make assumption-free statements concerning an algorithm's generalization (!) ability. 3 The Extended Bayesian Formalism -
'These papers use the extended Bayesian formalism (EBF-Wolpert 1992, 19941; Wolpert et 01. 1995). In the current context, the EBF is just conventional probability theory, applied to the case where one has a different random variable for the hypothesis output by the learning algorithm and for the target relationship. It is this crucial distinction that separates the EBF from conventional Bayesian analysis, and that allows the EBF (unlike conventional Bayesian analysis) to subsume all other major mathematical treatments of supervised learning like computational learning theory, sampling theory statistics, etc. (see Wolpert 1994a). This section presents a synopsis of the EBF. Points (2), (S), (14), and (15) below can be skipped in a first reading. A quick reference of this section's synopsis can be found in Table 1. Readers unsure of any aspects of this synopsis, and in particular unsure of any of the formal basis of the EBF or justifications for any of its (sometimes implicit) assumptions, are directed to the detailed exposition of the EBF in Appendix A . 3.1 Overview.
1. The input and output spaces are X and Y, respectively. They contain I I and Y elements, respectively. A generic element of X is indicated by s, and a generic element of Y is indicated by y. 2. Random variables are indicated using capital letters. Associated instantiations of a random variable are indicated using lower case letters. Note though that some quantities (e.g., the space X) are neither random variables nor instantiations of random variables, and therefore their written case carries no significance. Only rarely will it be necessary to refer to a random variable rather than an instantiation of it. In accord with standard statistics notation, "E(A 1 b)" will be used to mean the expectation value of A given B = b, i.e., to mean d a a P ( a 1 b ) . (Sums replace integrals if appropriate.)
Lack of Distinctions between Learning Algorithms
1349
Table 1: Summary of the Terms in the EBF -
The sets X and Y, of sizes n and r
The input and output space, respectively.
The training set. The set d, of m X-Y pairs The X-conditioned distribution over The target, used to generate test sets.
Y.f
The X-conditioned distribution over The hypothesis, used to guess for test sets. Y.h The cost. The real number c The X-value q The Y-value YF The Y-value YH
The test set point. The sample of the target f at point q. The sample of the hypothesis k at point q. The learning algorithm. The posterior. The likelihood. The prior.
If c = L(YF,YH),L ( . . .) is the ”lossfunction” L is ”homogeneous” if
XI,, D[c. L(YH.y ~ )is] independent of YH
If we restrict attention to f s given by a fixed noise process superimposed on an underlying single-valued funtion from X to Y, 4, and if &P(yr I 9.4) is independent of YF, we have “homogeneous“noise
3. The primary random variables are the hypothesis X-Y relationship output by the learning algorithm (indicated by H), the target (i.e., ”true”) X-Y relationship (F), the training set ( D ) ,and the real world cost (C). These variables are related to one another through other random variables representing the (test set) input space value (Q), and the associated target and hypothesis Y-values, YFand YH, respectively (with instantiations y~ and YH, respectively). This completes the list of random variables. As an example of the relationship between these random variables and supervised learning, f, a particular instantiation of a target, could refer to a ”teacher” neural net together with superimposed noise. This noise-corrupted neural net generates the training set d . The hypothesis h on the other hand could be the neural net made by one’s “student“ algorithm after training on d . Then q would be an input element of the test set, YF and yH associated samples of the outputs of the two neural
1350
David H. Wolpert
nets for that element (the sampling of y~ including the effects of the superimposed noise), and c the resultant ”cost” [e.g., c could be ( Y F - ? / H ) ’ ] . 3.2 Training Sets and Targets.
4. m is the number of elements in the (ordered) training set d. {dx(i). d y ( i ) } is the set of rn input and output values in d. m’ is the number of distinct values in dx. 5. Targetsf are always assumed to be of the form of X-conditioned distributions over Y, indicated by the real-valued function f(x E x . y E Y) [i.e., P(yp 1 f . q ) = f(q.YF)]. Equivalently, where s, is defined as the v-dimensional unit simplex, targets can be viewed as mappings f : X i S,. Any restrictions on f are imposed by P(f.h,d.c),and in particular by its marginalization, P(f). Note that any output noise process is automatically reflected in P(yf I f.9). Note also that the definition P ( y F 1 f.q ) = f(q. y F ) only directly refers to the generation of test set elements; in general, training set elements can be generated from targets in a different manner. 6. The ”likelihood” is P(d It says how d was generated fromf. It is “vertical” if P(d is independent of the valuesf(x. ~ J F ) for those x 4 dx. As an example, the conventional IID likelihood is
If)
If).
Ill
[where T ( X ) is the ”sampling distribution”]. In other words, under this likelihood d is created by repeatedly and independently choosing an input value dx(i) by sampling T(.),and then choosing an associated output value by sampling f[dx(i)..], the same distribution used to generate test set outputs. This likelihood is vertical. As another example, if there is noise in generating training set X values but none for test set X values, then we usually do not have a vertical P ( d I f). (This is because, formally speaking, f directly governs the generation of test sets, not training sets; see Appendix A.) 7. The ”posterior” usually means P(f I d), and the “prior” usually means P(f). 8. It will be convenient at times to restrict attention to fs that are constructed by adding noise to a single-valued function from X to Y, 4.For a fixed noise process, such f s are indexed by the underlying
4. The noise process is ”homogeneous” if the sum over all 4 of P(YF1 yF. An example of a homogeneous noise process is classification noise that with probability p replaces 4(q) with some other value in Y,where that ”other value in Y” is chosen uniformly and randomly.
q. 4) is independent of
Lack of Distinctions between Learning Algorithms
1351
3.3 The Learning Algorithm.
9. Hypotheses h are always assumed to be of the form of X-conditioned distributions over Y, indicated by the real-valued function h(x E X , y E Y) [i.e., P(YH1 h.q) = h(q.yH)]. Equivalently, where S , is defined as the r-dimensional unit simplex, hypotheses can be viewed as mappings h : X S,. ---f
Any restrictions on h are imposed by P(f.k. d. c). Here and throughout, a ”single-valued” distribution is one that, for a given x, is a delta function about some y. Such a distribution is a single-valued function from X to Y. As an example, if one is using a neural net as one‘s regression through the training set, usually the (neural net) h is single-valued. On the other hand, when one is performing probabilistic classification (as in softmax), h is not single-valued.
10. Any (!) learning algorithm (aka ”generalizer”) is given by P(k I d), although writing down a learning algorithm’s P(h 1 d) explicitly is often quite difficult. A learning algorithm is ”deterministic” if the same d always gives the same h. Backprop with a random initial weight is not deterministic. Nearest neighbor is. Note that since d is ordered, ”on-line” learning algorithms are subsumed as a special case. 11. The learning algorithm only sees the training set d, and in particular does not directly see the target. So P(k I f . d ) = P(h I d), which means that P(h.f I d) = P ( h I d ) x P(f I d), and therefore P(f I h. d ) = W . fI d ) / P ( h I 4 = P(f I 4. 3.4 The Cost and ”Generalization Error”.
12. For the purposes of this paper, the cost c is associated with a particular YH and YF, and is given by a loss function L(yH.yF). As an example, in regression, often we have “quadratic loss”: L(yH. y F ) =
(YH - YF)’. L ( . , . ) is ”homogeneous” if the sum over yf of b[c,L(y~.yF)]is some function A(c), independent of YH (6 here being the Kronecker delta function). As an example, the ”zero-one” loss traditional in computational learning theory [L(u,b ) = 1 if u # b, 0 otherwise] is homogeneous. 13. In the case of ”IID error” (the conventional error measure), P ( q I d) = ~ ( q(so ) test set inputs are chosen according to the same distribution that determines training set inputs). In the case of OTS error, P(q I 4 = [ S ( q !q dx)~(q)l/[C,b(q$ dx)~(q)l, where 6(z) = 1 if z is true, 0 otherwise. Subscripts OTS or IID on c correspond to using those respective kinds of error.
David H. Wolpert
1352
14. The "generalization error function" used in much of supervised learning is given by c' = E ( C 1 f.11. d ) . (Subscripts OTS or IID on c' correspond to using those respective ways to generate 9.) It is the average over all 67 of the cost c, for a given target f,hypothesis h , and training set d . In general, probability distributions oi'er c' d o not by themselves determine those over c or vice versa, i.e., there is not an injection between such distributions. However, the results in this paper in general hold for both L- and c', although they will be presented only for c. In addition, especially when relating results in this paper to theorems in the literature, sometimes results for c' will implicitly be meant even when the text still refers to c. (The context will make this clear.)
15. When the size of X, i z , is much greater than the size of the training set, m, probability distributions over & and distributions over become identical. (Although, as mentioned in the previous section, distributions conditioned on cllU can be drastically differThis is established formally in ent from those conditioned on &.) Appendix B. 4 The No-Free-Lunch Theorems
~
I n Wolpert (1992) it is shown that P ( c l d ) = d f d h P(hlcl)P(fld)M,.,,(f.I~), where so long as the loss function is symmetric in its arguments, M, ,[(.. .) is symmetric in its arguments. (See point (11) of the previous section.) In other words, for the most common kinds of loss functions (zero-one, quadratic, etc.), the probability of a particular cost is determined by an inner product between your learning algorithm and the posterior proband / I being the component labels of the d-indexed infiniteability. dimensional vectors P(f I d ) and P ( h I d ) , respectively.] Metaphorically speaking, how "aligned" you (the learning algorithm) are with the universe (the posterior) determines how well you will generalize. The question arises though of how much can be said concerning a particular learning algorithm's generalization behavior without specifying the posterior (which usually means without specifying the prior). More precisely, the goal is to address the issue of how F1, the set of targets f for which algorithm A outperforms algorithm B, compares to F I , the set of targetsf for which the reverse is true. To analyze this issue, the simple trick is used of comparing the average over f of f-conditioned probability distributions for algorithm A to the same average for algorithm B. The relationship between those averages is then used to compare FI to Fl. Evaluating suchf-averages results in a set of NFL theorems. In this section, first I derive the NFL theorems for the case where the target f need not be single-valued. In this case, the theorems say that uniformly averaged over allf, all learning algorithms are identical. The implications
v
Lack of Distinctions between Learning Algorithms
1353
of this for how F1 compares to F2 are discussed after the derivation of the theorems. When the target f is not single-valued, it is a (countable) set of real numbers (one for each possible x-y pair). Accordingly, any P(f) is a probability density function in a multidimensional space. That makes integrating over all P(f)s a subtle mathematical exercise. However in the function+noise scenario, for a fixed noise process, is indexed by a single-valued function 4. Since there are a countable number of bs, any P ( 4 ) is a countable set of real numbers, and it is straightforward to integrate over all P ( 4 ) . Doing so gives some more NFL theorems, where one uniformly averages over all priors rather than just over all targets. These additional theorems are presented after those involving averages over all targets f . After deriving these theorems, I present some examples of them, designed to highlight their counterintuitive aspects. I also present a general discussion of the significance of the theorems, and in particular of the uniform averaging that goes into deriving them. Here and throughout this paper, when discussing non-single-valued fs, ”A ( f )uniformly averaged over all targets f ” means J df A (f)/ 1df 1. Note that these integrals are implicitly restricted to thosef that constitute X-conditioned distributions over Y, i.e., to the appropriate product space of unit-simplices. (The details will not matter, because integrals will almost never need to be evaluated. But formally, integrals over targetsf are over a full r”-dimensional Euclidean space, with a product of Dirac delta functions and Heaviside functions inside the integrand enforcing the restriction to the Cartesian product of simplices.) Similar meanings for “uniformly averaged” are assumed if we are talking about averaging over other quantities, like P(4).
‘If”
4.1 Averaging over All Target Distributionsf. We start with the following simple lemma, that recurs frequently in the subsequent analysis. Its proof is in Appendix C .
Consider now the ”(uniform) random learning algorithm”: for any test set element not in the training set, guess the output randomly (independently of the training set d), according to a uniform distribution. (With certain extra stipulations concerning behavior for test set questions 9 E dx, this is a version of the Gibbs learning algorithm.) An immediate corollary of Lemma (l),proven in Appendix C , is that for this algorithm, for a symmetric homogeneous loss function, P(c I d) = A(c)/r for all training sets d. Similarly, for all priors over targets f , indicated by a, both P(c 1 m , a ) and P(c I d,n) equal A(C)/Y, for this random learning algorithm.
David H. Wolpert
1354
This simple kind of reasoning suffices to get "NFL" results for the random algorithm, even without invoking a vertical likelihood. However, more is needed for scenarios concerning other algorithms, scenarios in which there is "randomness," but it concerns targets rather than hypotheses. This is because we are interested in probability distributions conditioned on target-based quantities ( f , ( t , etc.), so results for when there is randomness in hypothesis-based quantities d o not immediately carry over to results for randomness in target-based quantities. To analyze these alternative scenarios, we start with the following simple implication of Lemma (1) (see Appendix C): The uniform average over all targets f of P(c 1 f . d ) equals (I...r)
1
V l , Ill
$[c.L(!lH.yr)j P I y H 1 9 . 4 P ( q I d )
.q
Recalling the definition of homogenous loss L, we have now proven the following:
Theorem 1. For 1mtzogeizeotis loss L, flu. cqfrnls .\(c);r.
iiiziforiii ozleruge oiler ullf
c$P(c 1 f . d )
Note that thisf-average is independent of the learning algorithm. So Theorem (1)constitutes an NFL theorem for distributions conditioned on targets f and training sets r f ; it says that uniformly averaged over all f , such distributions are independent of the learning algorithm. Note that this result holds independent of the sampling distribution, the training set, or the likelihood. As an example of Theorem (l),for the .lie) of zero-one loss, we get the f-average of E ( C I f . d ) = r - 1 1 ' ~ .More generally, for an even broader set of loss functions L than homogeneous Ls, the sum over target outputs of L(y,,.!/r) is independent of the hypothesis output, ykf. For such Ls we get generalizer-independence for the uniform average over targets f o f E ( C 1 f . d ) , even if we do not have such independence for the uniform average o f P ( c 1 f . 11). Note that Theorem (1) does not rely on having q lie outside of dx; it holds even for IID error. In addition, since bothf and d are fixed in the conditional probability in Theorem (1), any statistical coupling between f and i f is ignored in that theorem. For these kinds of reasons, Theorem ( I ) is not too interesting by itself. The main use of it is to derive other results, results that rely on using OTS error and that are affected by the coupling of targetsf and training sets d . As the first of these, I will show how to use Theorem (1) to evaluate the uniform f-average of P ( c 1 f . m ) for OTS error. In evaluating the uniformf-average of P ( c 1 f . m ) , not allfs contribute the same amount to the answer. That is because
P(c- 1 f . m )
=
If)
Epic / f . t l ) P ( d ,1
Lack of Distinctions between Learning Algorithms
1355
and so long as the likelihood P(d I f ) is not uniform over f , we cannot just pull the outsidef-average through to use Theorem (1) to reduce the P(c I f , d ) to A(c)/r. This might lead one to suspect that if the learning algorithm is "biased" toward the targets f contributing the most to the uniform f-average of P(c 1 f ,m), then the average would be weighted toward (or away from) low values of cost, c. However this is wrong; it turns out that the uniform f-average of P(c I f , m ) is independent of the learning algorithm, if one restricts oneself to OTS error. In fact, assume that we have any P(q I d) such that P(q E dx I d) = 0 [in particular, P(q 1 d) need not be the OTS P(q I d) discussed above]. For such a scenario, for a vertical likelihood [i.e., a P(d I f ) that is independent of the values of f ( x $ d x 3 .)I, we get the following result (see Appendix
Theorem 2. For OTS error, a vertical P(d I f ) , and a homogeneous loss L, the uniform average over all targets f o f P ( c I f m ) = A(c)/r. ~
Again, this holds for any learning algorithm, and any sampling distribution. Note that this result means in particular that the "weight" of fs on which one's algorithm performs worse than the random algorithm equals the weight for which it performs better. In other words, one can just as readily have a target for which one's algorithm has worse than random guessing as one in which it performs better than random. The pitfall we wish to avoid in supervised learning is not simply that our algorithm performs as poorly as random guessing, but rather that our algorithm performs worse than randomly! Using similar reasoning to that used to prove Theorem (2), we can derive the following theorem concerning the distribution of interest in conventional Bayesian analysis, P(c 1 d):
Theorem 3. For OTS error, a vertical P(d 1 f ) , uniform P ( f ) ,and a homogeneous loss L, P ( c I d ) = h ( c ) / r . The reader should be wary of equating the underlying logic behind a target-averaging NFL theorem [ e g , Theorem (2)] with that behind a uniform-prior NFL theorem [e.g., Theorem ( 3 ) ] . In particular, there are scenarios [i.e., conditioning events in the conditional distribution "P(c I . . .)"I in which one of these kinds of NFL theorem holds but not the other. See the discussion surrounding Theorem (9) below for an example. As an immediate corollary of Theorem (3),we have the following.
Corollary 1. For OTS error, a vertical P(d I f ) , uniform P(f),and a homogeneous loss L, P ( c I m ) = A(c)/r. As an aside, so long as L(a. b ) = L(b, a ) for all pairs a and b, the mathematics of the EBF is symmetric under interchange of h andf. [In particular, for any loss L, it is both true that P(f I h.d) = P ( f I d), and
1356
David H. Wolpert
that P(h I f . d ) = P ( k 1 d).] Accordingly, all of the NFL theorems have analogues where the hypothesis h rather than the target f is fixed and then uniformly averaged over. So for example, for OTS error, homogeneous L ( . . .), and a generalizer such that P ( d 1 h ) is independent of h(x $ d x ) , the uniform average over k of P ( c I k . m ) = A ( c ) / r . [For such a non-deterministic generalizer, assuming h l ( x ) = h ( x ) for all x E d x , the probability that the training set used to produce the hypothesis was d is the same, whether that produced hypothesis is h l or h2.1 Such results say that averaged over all hs the algorithm might produce, all posteriors over targets (and therefore all priors) lead to the same probability of cost, under the specified conditions. 4.2 Averaging over All Functions 4. Now consider the scenario where only those targets f are allowed that can be viewed as single-valued functions o from X to Y with noise superimposed (see Section 3). To analyze such a scenario, I will no longer consider uniform averages involving f directly, but rather uniform averages involving o. Accordingly, such averages are now sums rather than integrals. (For reasons of space, only here in this subsection will I explicitly consider the case of f s that are single-valued os with noise superimposed.) In this new scenario, Lemma (1) still holds, withf replaced by c>. However now we cannot simply set the uniform o-average of P(yr 1 q. d ) to l/r, in analogy to the reasoning implicitly used above [see the proof of the “implication of Lemma (1)” in Appendix C]. To give an extreme example, if the test set noise process is highly skewed and sends all O(9) to some fixed value yl, then the o-average is 1 for yr = y1, 0 otherwise. Intuitively, if the noise process always results in the test value y1, then we can make a priori distinctions between learning algorithms; an algorithm that always guesses yl outside of d, will beat one that does not. So for simplicity restrict attention to those noise processes for which the uniform 0-average of P(yF I q. o) is independent of the target output value yr. Recall that such a (test set) noise process is called ”homogeneous.” So following along with our previous argument (recounted in Appendix C), if we sum our o-average of P(yi 1 q . C I ) over all y ~ then , by pulling the sum over yi- inside the average over O, we see that the sum must equal 1. [Again, see the proof of the “implication of Lemma (l).”] Accordingly, the 0-average equals l / r . So we have the following analog of Theorem (1):
Theorem 4. For homogeneous loss L and a lionlogeneom test-set noise process, the uniform average over all single-smlued targetfiinctions o of P(c 1 (9. d ) equals .(I c) / Y . Note that the noise process involved in generating the training set is irrelevant to this result. (Recall that ”homogeneous noise” refers to yr and yFf, and that y F and YH are Y values for the test process, not the
Lack of Distinctions between Learning Algorithms
1357
training process.) This is also true for the results presented below. So in particular, all these results hold for any noise in the generation of the training set, so long as our error measure is concerned with whether or not h equals the (homogeneousnoise corrupted) sample of the underlying Q, at the test point q. (Note, in particular, that such a measure is always appropriate for noise-free-and therefore trivially homogeneous-tes t set generation). We can proceed from Theorem (4) to get a result for P(c 1 f m ) in the exact same manner as Theorem (1)gave Theorem (2). ~
Theorem 5. For OTS error, a vertical P(d I 4 ) , homogeneous loss L, and a homogeneous test-set noise process, the uniform average over all single-valued targetfunctions 4 ofP(c I 4, m ) equals A(c)/r. Just as the logic behind Theorem ( 2 ) also resulted in Theorem (3),so we can use the logic behind Theorem (5) to derive the following.
Theorem 6. For OTS error, a vertical P(d I d), homogeneous loss L, uniform P(@),and a homogeneous test-set noise process, P(c 1 d ) equals A(c)/r. Just as Theorem ( 3 ) resulted in Corollary (l),so Theorem (6) establishes the following.
Corollary 2. For OTS error, vertical P(d I d), homogeneous loss L, a homogeneous test-set H o k e process, and uniform P ( 4 ) , P(c I m ) equals A(c)/r. We are now in a position to extend the NFL theorems to the case where neither the prior nor the target is specified in the conditioning event of our distribution of interest, and the prior need not be uniform. For such a case, the NFL results concern uniformly averaging over priors P(4) rather than over target functions 4. Since there are rr1possible single-valued 4, P ( 4 ) is an r"-dimensional real-valued vector lying on the unit simplex. Indicate that vector as N, and one of its components [i.e., P ( 4 ) for one 41 as ~ 4 [More . formally, ( 2 is a hyperparameter: P($ 1 a ) = a$.] So the uniform average over all a of P(c 1 m. n ) is (proportional to) J d n P(c 1 m, a ) = ,f d o [ & P ( 4 1 a ) P(c I m. N,4)],where the integral is restricted to the r"-dimensional simplex. [a is restricted to lie on that simplex, since C4P(4 I 0) = a+ = 1.1 It is now straightforward to use Theorem (5) to establish the following result (see Appendix C):
x4
Theorem 7. Assume OTS error, a vertical P(d 1 $), homogeneous loss L, and a homogeneous test-set noise process. Let (1 index the priors P ( 4 ) . Then the uniform average over all (1 of P(c I m. a ) equals A(c)/r. It is somewhat more involved to calculate the uniform average over all priors (indexed by) (Y of P(c I d , a ) . The result is derived in Appendix D:
1358
David H. Wolpert
Theorem 8. Assume OTS error, a vertical P(d I $), homogeneous loss L, and a honiogeneoiis test set noise process. Let (1 index the priors P ( 4 ) . Then the iinifornz auerage ouer all a of P(c I d. a ) equals A(c)/r. By Corollary (2), Theorem (7) means that the average over all priors (r of P(c I m, a ) equals P(c I m. uniform prior). Similarly, by Theorem (6), Theorem (8) means that the average over all priors CY of P(c I d. a ) equals
P(c 1 d. uniform prior). In this sense, whatever one’s learning algorithm, one can just as readily have a prior that gives worse performance than that associated with the uniform prior as one that gives better performance. To put this even more strongly, consider again the uniform-random learning algorithm discussed at the beginning of this section. By Theorems (7) and (S), for any learning algorithm, one can just as readily have a prior for which that algorithm performs worse than the random learning algorithm-worse than random guessing-as a prior for which one‘s algorithm performs better than the random learning algorithm. It may be that for some particular (homogeneous) noise process, for some training sets d and target functions 4, P(c I 4.d) is not defined. This is the situation, for example, when there is no noise [d must lie on 4, so for any other d and 4, P ( c I 4 ? d ) is meaningless]. In such a situation, averaging over all $s with d fixed [as in Theorem (4)] is not well-defined. Such situations can, at the expense of extra work, be dealt with explicitly. [The result is essentially that all of the results of this section except Theorem (4) are obtained.] Alternatively, one can usually approximate the analysis for such noise processes arbitrarily well by using other, infinitesimally different noise processes, processes for which P(c I 4.d) is always defined. 4.3 Examples. Example 1: Say we have no noise in either training set or test set generation, and the zero-one loss L(., .). Fix two possible (single-valued) hypotheses, hl and h2. Let learning algorithm A take in the training set d, and guess whichever of hl and h2 agrees with d more often (the ”majority” algorithm). Let algorithm B guess whichever of h 1 and h2 agrees less often with d (the ”antimajority” algorithm). If hl and h2 agree equally often with d, both algorithms choose randomly between them. Then averaged over all target functions 4,E(C I 4,m ) is the same for A and B. As an example, take n = 5 and r := 2 (i.e., X = {0,1,2,3.4}, and Y = (0, 1}) and a uniform sampling distribution ~ ( x ) Take . m’, the number of distinct elements in the training set, to equal 4. For expository purposes, I will explicitly show that the average over all 4 of E(C I @ m’) is the same for A and B. [To calculate the average over all 4 of E(C I 4,m), one sums the average of E ( C 1 4.m’) P(m’ I m) over all m’.] I will take hl = the all Is h, and k2 = the all 0s h.
1. There is one
4 that is all 0s (i.e., for which for all X values, Y = 0).
Lack of Distinctions between Learning Algorithms
1359
For that 4,algorithm A always picks hZ,and therefore E(C I 4. rn’ = 4) = 0; algorithm A performs perfectly. For algorithm B, expected
c = 1.
2. There are five 4s with one 1. For each such 4,the probability that the training set has all four zeroes is 0.2. The value of C for such training sets is 1 for algorithm A, 0 for B. For all other training sets, C = 0 for algorithm A, and 1 for algorithm B. So for each of these $s, the expected value of C is 0.2(1) + 0.8(0) = 0.2 for A, and 0.2(0) 0.8(1) = 0.8 for B.
+
3. There are 10 4s with two Is. For each such 4, there is a 0.4 probability that the training set has one 1, and a 0.6 probability that it has both 1s. (It can’t have no 1s.) If the training set has a single 1, so does the OTS region, and C = 1 for A, 0 for B. If the training set has two Is, then our algorithms say guess randomly, so (expected) C = 0.5 for both algorithms. Therefore for each of these 4s, expected C = 0.4(1) 0.6(.5) = 0.7 for algorithm A, and 0.4(0) + 0.6(.5) = 0.3 for B. Note that here B outperforms A.
+
4. The case of 4s with three 1s is the same as the case of 4s with two 1s (just with ”1”replaced by ”0” throughout). Similarly, four 1s = one, and five Is = one. So it suffices to just consider the cases already investigated, where the number of 1s is zero, one, or two. 5. Adding them up, for algorithm A we have one 4 with (expected) C = 0, five with C = 0.2, and 10 with C = 0.7. So averaged over all those $s, we get [1(0) 5(0.2) 10(0.7)]/[1 5 101 = 0.5. This is exactly the same expected error as algorithm B has: expected error for B is [1(1)+ 5(0.8) 10(0.3)]/16= 0.5. QED.
+ +
+
+ +
See Example 5 in paper two for a related example. Example 2: An algorithm that uses cross-validation to choose among a prefixed set of learning algorithms does no better on average than one that does not, so long as the loss function is homogeneous. In addition, cross-validation does no better than anti-cross-validation (choosing the learning algorithm with the worst cross-validation error) on average. In particular, the error on the validation set can be measured using a nonhomogeneous loss (e.g., quadratic loss), and this result will still hold; all that is required is that we use a homogeneous loss to measure error on the test set. Alternatively, construct the following algorithm: ”If cross-validation says one of the algorithms under consideration has particularly low error in comparison to the other, use that algorithm. Otherwise, choose randomly among the algorithms.” Averaged over all targets, this algorithm will do exactly as well as the algorithm that always guesses randomly among the algorithms. In this particular sense, you cannot rely on crossvalidation’s error estimate (unless you impose a prior over targets or some such).
1360
David H. Wolpert
Note that these results don‘t directly address the issue of how accurate cross-validation is as an estimator of generalization accuracy; the object of concern here is instead the error that accompanies use of crossvalidation. For a recent discussion of the accuracy question (though in a non-OTS context), see Plutowski et nl. (1994). For a more general discussion of how error and accuracy-as-an-estimator are statistically related (especially when that accuracy is expressed as a confidence interval), see Wolpert (1994a). The issue of how accurate cross-validation is as an estimator of generalization accuracy is also addressed in the discussion just below Theorem (9), and in the fixed$ results in paper two. Example 3: Assume you are a Bayesian, and calculate the Bayesoptimal guess assuming a particular P ( f ) [i.e., you use the P(h Lf) that would minimize the data-conditioned risk E ( C 1 d ) , if your assumed P(f) were the correct P ( f ) ] . You now compare your guess to that made by someone who uses a non-Bayesian method. Then the NFL theorems mean (loosely speaking) that there are as many actual priors (your assumed prior being fixed) in which the other person has a lower dataconditioned risk as there are for which your risk is lower. Example 4: Consider any of the heuristics that people have come up with for supervised learning: avoid “over-fitting,” prefer ”simpler” to more “complex” models, ”boost” your algorithm, ”bag” it, etc. The NFL theorems say that all such heuristics fail as often (appropriately weighted) as they succeed. This is true despite formal arguments some have offered trying to prove the validity of some of these heuristics. 4.4 General Implications of the NFL Theorems. The primary importance of the NFL theorems is their implication that, for any two learning algorithms A and B, according to any of the distributions P ( c 1 d ) , P(c 1 m ) , P ( c 1 f . d ) , or P ( c 1 f ,i n ) , there are just as many situations (appropriately weighted) in which algorithm A is superior to algorithm B as vice versa. So in particular, if we know that learning algorithm A is superior to B averaged over some set of targets F , then the NFL theorems tell us that B must be superior to A if one averages over all targets not in F . This is true even if algorithm B is the algorithm of purely random guessing. Note that much of computational learning theory, much of sampling theory statistics (e.g., bias + variance results), etc., is based on quantities like P(c- 1 f.tn), or on other quantities determined by P(c I f . m ) (see Wolpert 1994a). Similarly, conventional Bayesian analysis is concerned with P ( c 1 d ) . All of these quantities are addressed in the NFL theorems. As a special case of the theorems, when there are only two possible values of L(.. .), any two algorithms are even more tightly matched in behavior than Theorems (1) through (8) indicate. [An example of such an L ( . . ) is zero-one loss, for which there are only two possible values of L ( . ‘1, regardless of r.] Let C, and C2 be the costs associated with two learning algorithms. Now P{cI I stuff) = C,:P(cr.c2 stuff), and similarly for P(cz I stuff). (Examples of “stuff” are {d.f}, {ni},f-averages of these,
Lack of Distinctions between Learning Algorithms
1361
etc.) If L(.. .) can take on two values, this provides us four equations (one each for the two possible values of c1 and the two possible values of c2) in four unknowns [P(cl,c2 I stuff) for the four possible values of c1 and cz]. Normalization provides a fifth equation. Accordingly, if we know both P(cl 1 stuff) and P(c2 1 stuff) for both possible values of c1 and c2, we can solve for P(cl.c2 I stuff) (sometimes up to some overall unspecified parameters, since our five equations are not independent). In particular, if we know that Pc,(c I stuff) = Pc,(c I stuff), then P(cI,cz I stuff) must be a symmetric function of c1 and c2. So for all of the "stuff"s in the NFL theorems, when L(.. .) can take on two possible values, for any two learning algorithms, P ( c l . c2 I stuff) is a symmetric function of c1 and c2 (under the appropriate uniform average).' All of the foregoing applies to more than just OTS error. In general IID error can be expressed as a linear combination of OTS error plus ontraining set error, where the combination coefficients depend only on dx and T(X E d x ) . So generically, if two algorithms have the same on-training set behavior (e.g., they reproduce d exactly), the NFL theorems apply to their IID errors as well as their OTS set errors. (See also Appendix B.) Notwithstanding the NFL theorems though, learning algorithms can differ in that (I) for particular f,or particular (nonuniform) P(f),different algorithms can have different probabilities of error (this is why some algorithms tend to perform better than others in the real world); (2) for some algorithms there is a distribution-conditioning quantity (e.g., an f) for which that algorithm is optimal (i.e., for which that algorithm beats all other algorithms), but some algorithms are not optimal for any value of such a quantity; and more generally (3) for some pairs of algorithms the NFL theorems may be met by having comparatively many targets in which algorithm A is just slightly worse than algorithm B, and comparatively few targets in which algorithm A beats algorithm B by a lot. These points are returned to in paper two. 4.5 Extensions for Nonuniform Averaging. The uniform sums over f [or 4, or P ( 4 ) ] in the NFL theorems are not necessary conditions for those theorems to hold. As an example, consider the version of the theorems for which targets are single-valued functions 4 from X to Y, perhaps with output-space noise superimposed, and where one averages over priors o. It turns out that we recover the NFL result for that scenario if we average according to any distribution over the o! which is invariant under relabeling of the 4. We do not need to average according to the uniform distribution, and in fact can disallow all priors that are too close to the uniform prior. More formally, we have the following variant of Theorem (7), proven in Appendix C: 'For more than two possible values of L ( . . .), it is not clear what happens. Nor clear how much of this carries over to costs C' (see Section 3) rather than C.
1s j t
1362
David H. Wolpert
Corollary 3. Assume OTS error, a uerticd P ( d I o ) ,homogeneous loss L, and a homogeneous test-set noise process. Let [I index the priors P(d), and let G ( o ) be a distribution over ( I . Assiirne G(ct) is itzz?ariant iinder the transformation of the priors ( i induced by relabeling the targets o. Then the aiwage according to G(tr) o f P ( c j rn.(1) equals * \ ( c ) / v As a particular example of this result, define ( I * to be the uniform prior, that is the vector all of whose components are equal. Then one G (( I ) that meets the assumption in Corollary ( 3 ) is the one that is constant over o except that it excludes all vectors o lying within some L2 distance of (I* [i.e., one C ( o ) that meets the assumption is the one that excludes all priors o that are too close to being uniform]. This is because rearranging the components of a vector does not change the distance between that vector and ( I * , so any G ( o ) that depends only on that distance obeys the assumption in Corollary ( 3 ) . Combined with Corollary (3)’ this means that G ( t t ) can have structure-it can have a huge amount of structure-and we still get NFL. Alternatively, the set of allowed priors can be tiny, and restricted to priors ( I with a lot of structure (i.e., to priors lying far from the uniform prior), and we still get NFL. Loosely speaking, there are just as many priors that have lots of structure for which your favorite algorithm performs worse than randomly as there are for which it performs better than randomly. An open question is whether the condition on G ( o )in Corollary ( 3 )is a necessary condition to have the average according to G((k ) of P(c 1 m.o ) equal . l ( c ) / r . Interestingly, we d o not have the same kind of result when considering averages over targets f of P( c I f . m ) rather than averages over o of Pic I ~ n0). . This is because there is no such thing as a “uniformf” that we can restrict the average away from with the same kind of implications as restricting an average away from a uniform prior. However, by Theorem ( 2 ) , for any pair of algorithms, there are targets that “favor” the first of the two algorithms, and there are targets that favor of the second. So by choosing from both sets of targets, we can construct many distributions r ( f )that have a small support and such that the average of P ( c I f . ni) according to r(f ) is the same for both algorithms. Indeed, an interesting open question is characterizing the set of such r ( f )for any particular pair of algorithms. 4.6 On Uniform Averaging. The results of the preceding subsection notwithstanding, it is natural to pay a lot of attention to the original uniform average forms of the NFL theorems When considering those forms, it should be kept in mind that the uniform averages overf [or u, or F‘(O)] were not chosen because there is strong reason to believe that allf are equally likely to arise in practice. Indeed, in many respects it is absurd to ascribe such a uniformity over possible targets to the real
Lack of Distinctions between Learning Algorithms
1363
world. Rather the uniform sums were chosen because such sums are a useful theoretical tool with which to analyze supervised learning. For example, the implication of the NFL theorems that there is no such thing as a general-purpose learning algorithm that works optimally for allflP(q5) is not too surprising. However, even if one already believed this implication, one might still have presumed that there are algorithms that usually do well and those that usually do poorly, and that one could perhaps choose two algorithms so that the first algorithm is usually superior to the second. The NFL theorems show that this is not the case. If all fs are weighted by their associated probability of error, then for any two algorithms A and B there are exactly as manyfs for which algorithm A beats algorithm B as vice versa. Now if one changes the weighting over fs to not be according to the algorithm’s probability of error, then this result would change, and one would have a priori distinctions between algorithms. However, a priori, the change in the result could just as easily favor either A or B. Accordingly, claims that ”in the real world P(f) is not uniform, so the NFL results do not apply to my favorite learning algorithm” are misguided at best. Unless you can prove that the nonuniformity in P(f)is well-matched to your favorite learning algorithm (rather than being “antimatched” to it), the fact that P(f)may be nonuniform, by itself, provides no justification whatsoever for your use of that learning algorithm [see the inner product formula, Theorem (l),in Wolpert 1994al. In fact, the NFL theorems for averages over priors P ( 4 ) say (loosely speaking) that there are exactly as many priors for which any learning algorithm A beats any algorithm B as vice versa. So uniform distributions over targets are not an atypical, pathological case, out at the edge of the space. Rather they and their associated results are the average case(!). There are just as many priors for which your favorite algorithm performs worse than pure randomness as for which it performs better. [Recall the discussion just below Theorem (S).] So for the learning scenarios considered in this section (zero-one loss, etc.) the burden is on the user of a particular learning algorithm. Unless they can somehow show that P ( 4 ) is one of the ones for which their algorithm does better than random, rather than one of the ones for which it does worse, they cannot claim to have any formal justification for their learning algorithm. In fact if you press them, you find that in practice very often people’s assumption do not concern P(q5) at all, but rather boil down to the statement “okay, my algorithm corresponds to an assumption about the prior over targets; I make that assumption.” This is unsatisfying enough a formal justification as it stands. Unfortunately though, for many algorithms, no one has even tried to write down that set of P(q5) for which their algorithm works well. This puts the purveyors of such statements in the awkward position of invoking an unknown assumption. (Moreover, for some algorithms one can show that there is no assumption solely
1364
David H. Wolpert
concerning targets that justifies that algorithm in all contexts. This is true of cross-validation, for example; see paper two.) Given this breadth of the implications of the uniform-average cases, it is not surprising that uniform distributions have been used before to see what one can say a priori about a particular learning scenario. For example, the “Ugly Duckling Theorem” (Watanabe 1985) can be viewed as (implicitly) based on a uniform distribution. Another use of a uniform distribution, more closely related to the uniform distributions occurring in this paper, appears in the ”problem-averaging” work of Hughes (1968). [See Waller and Jain (1978) as well for a modern view of the work of Hughes.] The words of Duda and Hart (1973) describing that work are just as appropriate here: “Of course, choosing the a priori distribution is a delicate matter. We would like to choose a distribution corresponding to the class of problems we typically encounter, but there is no obvious way to do that. A bold approach is merely to assume that problems are ”uniformly distributed”. Let us consider some of the implications (of such an assumption).” In this regard, note that you really would need a proof based completely on first principles to formally justify some particular (nonuniform) P(f).In particular, you cannot use your “prior knowledge” (e.g., that targets tend to be smooth, that Occam’s razor usually works, etc.) to set P(f),without making additional assumptions about the applicability of that ”knowledge” to future supervised learning problems. This is because that ”prior knowledge” is ultimately an encapsulation of two things: the data set of your experiences since birth, and the data set of your genome‘s experiences in the several billion years it has been evolving. So if you are confronted with a situation differing at all (!) from the previous experiences of you and/or your genome, then you are in an OTS scenario. Therefore the NFL theorems apply, and you have no formal justification for presuming that your ”prior knowledge” will apply off-training set (i.e., in the future). An important example of this is the fact that even if your prior knowledge allowed you to generalize well in the past, this provides no assurances whatsoever that you can successfully apply that knowledge to some current inference problem. The fact that a learning algorithm has been used many times with great success provides no formal (!) assurances about its behavior in the future.’ After all, assuming that how well you generalized in the past carries over to the present is formally equivalent to (a variant of) cross-validation-in both cases, one tries to extrapolate from generalization accuracy on input points for which we now know what the correct answer was, to generalization behavior in general. Finally, it is important to emphasize that results based on averag‘All of this is a formal statement of a rather profound (if somewhat philosophical) paradox: How is it that we perform inference so well in practice, given the NFL
theorems and the limited scope of our prior knowledge! A discussion of some “headto-head minimax“ ideas that touch on this paradox is presented in paper two.
Lack of Distinctions between Learning Algorithms
1365
ing uniformly over f/@/P(@) should not be viewed as normative. The uniform averaging enables us to reach conclusions that assumptions are needed to distinguish between algorithms, not that algorithms can be (profitably) distinguished without any assumptions, i.e., if such an average ends up favoring algorithm A over B (as it might for a nonhomogeneous loss function, for example), that only means one ”should” use A if one has reason to believe that allf are equally likely a priori.
4.7 Other Peculiar Properties Associated with OTS Error. There are many other aspects of OTS error that, although not actually NFL theorems, can nonetheless be surprising. An example is that in certain situations the expected (over training sets) OTS error grows as the size of the training set increases, even if one uses the best possible learning algorithm, the Bayes-optimal learning algorithm [i.e., the learning algorithm which minimizes E(C I d)-see Wolpert (1994a)l. In other words, sometimes the more data you have, the less you know about the OTS behavior of 4,on average. In addition, the NFL theorems have strong implications for the common use of a ”test set” or ”validation set” T to compare the efficacy of different learning algorithms. The conventional view is that the error measured on such a set is a sample of the full generalization error. As such, the only problem with using error on T to estimate ”full error” is that error on T is subject to statistical fluctuations, fluctuations that are small if T is large enough. However if we are interested in the error for x 4 { d u T } , the NFL theorems tell us that (in the absence of prior assumptions) error on T is meaningless, no matter how many elements there are in T . Moreover, as pointed out in Section (4) of the second of this pair of papers, use of test sets cannot correspond to an assumption only about targets [i.e., there is no P(f) that, by itself, justifies the use of test sets]. Rather use of test sets corresponds to an assumption about both targets and the algorithms the test set is being used to choose between. Use of test sets will give incorrect results unless one has a particular relationship between the target and the learning algorithms being chosen between. In all this, even the ubiquitous use of test sets is unjustified (unless one makes assumptions). For a discussion of this point and of intuitive arguments for why the NFL theorems hold, see Wolpert (1994a). 5 The NFL Theorems and Computational Learning Theory This section discusses the NFL theorem’s implications for and relationship with computational learning theory. Define the empirical error 111
David H. Wolpert
1366
Sometimes the values .ir[dx(i)]in this definition are replaced by a constant; doing so has no effect on the analysis below. As an example, for zero-one loss and single-valued h, s is the average misclassification rate of h over the training set. Note that the empirical error is implicitly a function of d and h but of nothing else (T being fixed). (For deterministic learning algorithms, this reduces to being a function only of d.) So for example P ( s I d,f)= J dhP(s I d , f , h ) P ( h I d ) = J d h P ( s 1 d , h ) P ( h I d ) = P ( s I d). This section first analyzes distributions over C that involve the value of s, as most of computational learning theory does. Then it analyzes OTS behavior of “membership queries” algorithms and also of ”punting” algorithms (those that may refuse to make a guess), algorithms that are also analyzed in computational learning theory. 5.1 NFL Theorems Involving Empirical Error. Some of the NFL theorems carry over essentially unchanged if one conditions on s in the distribution of interest. This should not be too surprising. For example, consider the most common kind of learning algorithms, deterministic ones that produce single-valued hs. For such learning algorithms, the training set d determines the hypothesis h and therefore determines s. So specifying s in addition to d in the conditioning statement of the probability distribution provides no information not already contained in d . This simple fact establishes the NFL theorem for P(c I f , d . s ) , for these kinds of learning algorithms. More generally, first follow along with the derivation of Lemma (l), to get
where use was made of the identities P(yF I q , f . s ) = P(yF I q-f), and P(q 1 d , s ) = P(q 1 d ) . (Both identities follow from the fact that PAIB,S.D,H[LI1 b, s(d, h ) ,d , h ) ]= P(a 1 b, d , h ) for any variables A and B.) Continuing along with the logic that resulted in Theorem (l),we arrive at the following analogue of Theorem (1) (that holds even for nondeterministic learning algorithms, capable of guessing non-single-valued 11s):
For homogeneous loss L, the uniform average over allf of P(c equals A (c)/Y.
If?
d, s)
Unfortunately, one cannot continue paralleling the analysis in Section (3) past this point, to evaluate quantities like the uniform average over all f of P(c s, m). The problem is that whereas P ( d I f,m ) is independent of f ( x 4 d x ) (for a vertical likelihood), the same need not be true of P(d I f,s. m). Indeed, often there are fs for which P(COTS1 s. in) is not defined; for no d sampled from that f will an h be produced that has
If,
f?
Lack of Distinctions between Learning Algorithms
1367
error s with that d. In such scenarios the uniformf-average of P(cors
I
f.s. rn) is not defined. Moreover, the set off for which P(coTs I f.s. rn) is defined may vary with s. The repercussions of this carry through for any attempt to create s-conditioned analogs of the NFL theorems. (A counter-intuitive example of how the NFL theorems need not hold for s-conditioned distributions is presented in Appendix C.) In fact, it is hard to say anything general about P(c I f>s.rn). In particular, it is not always the (peculiar) case that higher s results in lower COTS iff is fixed, as in the example in Appendix C. To see this, consider the scenario given there with a simple change in the learning algorithm. For the new learning algorithm, if all input elements of the training set, d x, are in some region 2, then an hypothesis h is produced that happens to equal the target f, whereas for any other dxs, there are errors both on and off dx. So if s = 0, we know that COTS = 0. But if s > 0, we know that COTS > 0; raising s from 0 has raised expected COTS. Now consider P(c I s , d ) for uniform P(f),where it is implicitly assumed that for at least one k for which P ( h I d) # 0, the empirical error is s, so P ( s . d ) # 0. For this quantity we do have an NFL result that holds for any learning algorithm (see Appendix C): Theorem 9. For kornugeneous L, OTS error, a vertical likel~koud,and uniform P(f),Pjc I s. d ) = A(c)/r. The immediate corollary is that for homogeneous L, OTS error, a vertical likelihood, and uniform P(f),P(c I s.rn) = A(c)/r, independent of the learning algorithm. It is interesting to note that a uniform P(f)can give NFL for P ( c I s. rn) even though a uniform average over f of P(c I f.s. m ) does not. This illustrates that one should exercise care in equating the basis of NFL for f-conditioned distributions [Theorem (2)] with having a uniform prior. An immediate question is how Theorem (9) can hold despite the example above where as s shrinks E(CoTs I f.s, rn) grows, for any target f. The answer is that P(c I s. rn) = dfP(c I f.s. rn) P(f I s, m ) . Even if for any fixed target f the quantity P ( c 1 f.s. rn) gets biased toward lower cost c as the empirical error s is raised, this does not mean that the integral exhibits the same behavior. As an aside, it should be noted that the only property of s needed by Theorem (9) or its corollary is that P(s 1 d,f) = P ( s I d). In addition to holding for the random variable S, this property will hold for any random variable u that is a function only of d for the algorithms under consideration. So in particular, we can consider using cross-validation to choose among a set of one or more deterministic algorithms. Define (T as the cross-validation errors of the algorithms involved. Since for a fixed set of deterministic algorithms (T is a function solely of d, we see that for a uniform P(f),(T is statistically independent from C; there is no information contained in the set of cross-validation errors that has bearing on generalization error. In this sense, unless one makes an explicit
David H. Wolpert
1368
assumption for P(f),cross-validation error has no use as an estimate of generalization error. 5.2 Compatibility with Vapnik-Chervonenkis Results. The fact that P(c I s. m ) = . 2 ( c ) / r (under the appropriate conditions) means that P(c I s . m ) = P(c 1 m ) under those conditions [see Corollary (l)].This implies that Pis 1 c. m ) is independent of cost c. So C~~~ and empirical error S are statistically independent, for uniform P( f ), homogeneous L, and a vertical likelihood. Indeed, in Appendix B in Wolpert (1992) there is an analysis of the case where we have a uniform sampling distribution T ( . ) , zero-one loss, binary Y, and i i / m x (so Cy)Ts CitD;see Appendix I3 of this paper). It is there proven that E(C;,, 1 s. ni) = 1/2, independent of s. In accord with this, one expects that C&Ts and S are independent for uniform P ( f ) . On the other hand, Vapnik-Chervonenkis (uniform convergence) theory tells us that P(c;,, s I m ) is biased toward small values of cilD- s for low-VC dimension generalizers, and large m. This is true for any prior P(f), and therefore in particular for a uniform prior. It is also true even when i7 > nz, so that CkITS and Ci,Dclosely approximate each other. It should be emphasized that there is no contradiction between these VC results and the NFL theorems. Independence of s and c& does not imply that s and c;,Ts can differ significantly. For example, both the VC results and the NFL theorems would hold if for manyf P(c;,.,, 1 f . m ) and P ( s 1 f . m ) were independent but were both tightly clumped around the same value, i. Now let u s say we have an instance of such a "clumping" phenomenon, but do not know (< being determined by (the unknown) f,among other things). We might be tempted to take the observed value of s as an indicator of the likely value of <. In turn, we might wish to view this likely value of as an indicator of the likely value of c&. In this way, having observed a particular value of s, we could infer something about c;,Ts (e.g., that it is unlikely to differ from that observed value of 5). However Theorem (9) says that this reasoning is illegal [at least for uniform P(f)].Statistical independence is statistical independence; knowing the value of s tells you nothing whatsoever about the value of c& (see Wolpert 1994a for further discussion of how independence of s and c ; ) ~ ~is compatible with the VC theorems). Intuitively, many of the computational learning theory results relating empirical error s and generalization error c:,, are driven by the fact that s is formed by sampling cilD (see Wolpert 1994a). However, for OTS c' the empirical error s cannot be viewed as a sample of c'. Rather s and c& are on an equal footing. Indeed, for single-valued targets and hypotheses, and no noise, s and cbTSare both simply the value c;lo has when restricted to a particular region in X. (The region is ilx for s, X - i f x for cbTS.) In this sense, there is symmetry between s and c:>Ts (symmetry absent for s
-
-
~
<
<
Lack of Distinctions between Learning Algorithms
1369
and chD). Given this, it should not be surprising that for uniform P(f), the value of s tells us nothing about the value of cbTs and vice versa. 5.3 Implications for Vapnik-Chervonenkis Results. The s-independence of the results presented above has strong implications for the uniform convergence formalism for investigating supervised learning (Vapnik 1982; Vapnik and Bottou 1993; Anthony and Biggs 1992; Natarajan 1991; Wolpert 1994a). Consider zero-one loss, where the empirical error s is very low and the training set size m is very large. Assume that our learning algorithm has a very low VC dimension. Since s is low and m large, we might hope that that low VC dimension confers some assurance that our generalization error will be low, independent of assumptions concerning the target. (This is one common way people try to interpret the VC theorems.) However according to the results presented above, low s, large m, and low VC dimension, by themselves, provide no such assurances concerning OTS error (unless one can somehow a priori rule out a uniform P(f)-not to mention rule out any other prior having even more dire implications for generalization performance). This is emphasized by the example given above where a tight confidence interval on the probability of cbTSdiffering from s arises solely from P(cbTSI m ) and P(s I m ) being peaked about the same value; s and cbTSare statistically independent, so knowing s tells you nothing concerning cbTS. Indeed, presuming cbTs is small due only to the fact that s, m, and the learning algorithm’s VC dimension are small can have disastrous real-world consequences (see the example concerning “We-Learn-It Inc.” in Wolpert 1994a). Of course, there are many other conditioning events one could consider besides the ones considered in this paper. And, in particular, there are many such events that involve empirical errors. For example, one might investigate the behavior of the uniformf-average of P(c I sA. s B , m,f), where SA and S B are the empirical errors for the two algorithms A and B considered in Example (1) in Section (3). It may well be that for some of these alternative conditioning events involving empirical errors, one can find a priori distinctions between learning algorithms, dependences on s values, or the like. Although such results would certainly be interesting, one should be careful not to ascribe too much practical significance to them. In the real world, it is almost always the case that we know d and h in full, not simply functions of them like the empirical error. In such a scenario, it is hard to see why one would be concerned with a distribution of the form P[c 1 function(d), h], as opposed to distributions of the form P(c I d ) [or perhaps P(c 1 d , h ) , or thef-average of P(c 1 d,f), or some such]. So since the NFL theorems say there is no a priori distinction between algorithms as far as P(c I d ) is concerned, it is hard to see why one should choose between algorithms based on distributions of the form P[c Ifunction(d),h], if one does indeed know d in full.
1370
David H. Wolpert
5.4 Implications of the NFL Theorems for Active Learning Algorithms. Active learning (aka ”query-based learning,” or ”membership queries”) is where the learner decides what the points dx will be. Usually this is done dynamically; as one gets more and more training examples, one uses those examples to determine the ”optimal” next choices of d x ( i ) . As far as the EBF is concerned, the only difference between active learning and traditional supervised learning is in the likelihood. Rather than IID likelihoods like that in equation (3.1),in active learning each successivedx(i)isafunctionofthe (i-1) pairs { d x ( j = l.i-l).dy(j = 1 . i I )}, with the precise functional dependence determined by the precise active learning algorithm being used. So long as it is true that P[dY(,nr)1 d x ( n r ) . f ] is independent of f [ x # i f x (n ~ ) active ], learning has a vertical likelihood (see Appendix C). So all of the negative implications of the NFL theorems apply just as well to active learning as IID likelihood learning, and in particular apply to the kinds of active learning discussed in the computational learning community. 5.5 Implications of the NFL Theorems for “Punting” Learning Algorithms. Some have advocated using algorithms that have an extra option besides making a guess. This option is to “punt,” i t . , refuse to make a guess. As an example, an algorithm might choose to punt because it has low confidence in its guess (say for VC theory reasons). It might appear that, properly constructed, such algorithms could avoid making bad guesses. If this were the case, it would be an assumption-free way of ensuring that iLhetl onegiiesses, the guesses are good. (One would have traded in the ability to always make a guess to ensure that the guesses one does make are good ones.) In particular, some have advocated using algorithms that add elements to d adaptively until (and if) they can make what they consider to be a safe guess. However the simple fact that a particular punting algorithm has a small probability of making a poor guess, by itself, is no reason to use that algorithm. After all, the completely useless algorithm that always punts has zero probability of making a poor guess. Rather what is of interest is how well the algorithm performs when it does guess, and/or how accurate its punt-signal warning is as an indicator that to make a guess would result in large error. To analyze this, I will slightly modify the definition of punting algorithms so that they always guess, but also always output a punt / no punt signal (and perhaps ask for more training set elements), based deterministically only on the d at hand. The issue a t hand then is how the punt / no punt signal is statistically correlated with C. Examine any training set ii for which some particular algorithm outputs a no punt signal. By the NFL theorems, for such a d, for uniform P ( . f ) ,a vertical P ( d 1 f),and a homogeneous OTS error, P ( c 1 cl) is the same as that of a random generalizer, i.e., under those conditions, P ( c I if. no punt) = .\(c)/r. As a direct corollary, P ( c 1 ur. no punt) = . l ( c ) / r . It
Lack of Distinctions between Learning Algorithms
1371
follows that P ( c 1 no punt) = A ( c ) / r (assuming the no punt signal arises while OTS error is still meaningful, so m’ < n). Using the same kind of reasoning though, we also get P(c I punt) = A(c)/r, etc. So there is no statistical correlation between the value of the punt signal and OTS error. Unless we assume a nonuniform P(f),even if our algorithm “grows” d until there is a no punt signal, the value of the punt / no punt signal tells us nothing about C. Similar conclusions follow from comparing a punting algorithm to its ”scrambled” version, as in the analysis of nonhomogeneous error (see paper two). In addition, let A and B be two punting algorithms that are identical in when they decide to output a punt signal, but B guesses randomly for all test inputs q 6 d x . Then for the usual reasons, As distribution over OTS error is, on average, the same as that of B, i.e., no better than random. This is true even if we condition on having a no punt signal. One nice characteristic of some punting algorithms-the characteristic exploited by those who advocate such algorithms-is that there can be some prior-free assurances associated with them. As an example, for all targets f , the probability of such an algorithm guessing and making an error in doing so is very small [see classes (1) and (2) below]: Vf, for sufficiently large m and nonnegligible E , P(COTS> &.nopunt I f . m ) is tiny. However P(COTS> E , no punt I f.m ) in fact equals 0 for the alwayspunt algorithm. So one might want to also consider other distributions like P(COTS> E 1 no punt.f. m ) or P(COTS< 1 - &.nopunt 1 f.m ) to get a more definitive assessment of the algorithm’s utility. Unfortunately though, both of these distributions are highly f-dependent. (This illustrates that thef-independent aspects of the punting algorithm mentioned in the previous paragraph do not give a full picture of the algorithm’s utility.) In addition, other f-independent results hardly inspire confidence in the idea of making a guess only when there is a no punt signal. As an illustration, restrict things so that both hypotheses k and targets are single-valued (and therefore targets are written as functions &), and there is no noise. Y is binary, and we have zero-one loss. Let the learning algorithm always guess the all 0s function, k*. The punt signal is given if d y contains at least one non-zero value. Then for the likelihood of (3.1), uniform ~(x),and n >> m, we have the following result, proven in Appendix E:
Theorem 10. Fur the k’ learning algurithm,fur all targets d, such that Q(x) = 0 fur mure than rn distinct x,E ( C O T I~4\punt, m ) 5 E ( C ~ IT4. ~nu punt, m). For n >> m, essentially all & meet the requirement given in Theorem (10); for such n and m, we do better to follow the algorithm’s guessing advice when we are told not to than we are told the guessing is good! In many respects, the proper way to analyze punting algorithms is given by decision theory. First, assign a cost to punting. (Formally, this
1372
David H. Wolpert
just amounts to modifying the form of P ( c I f . 11. d ) for the case where h and d lead to a punt signal.) This cost should not be less than the minimal no-punting cost, or the optimal algorithm is to never guess. Similarly, it should not be more than the maximal no-punting cost, or the optimal algorithm never punts. Given such a punting cost, the analysis of a particu~ lar punting algorithm consists of finding those P(f)such that E ( C ~ I Tin) is "good" (however defined). In lieu of such an analysis, one can find ~ m ) (e.g., one those P(f)such that E(CoTs I no punt. m ) < E ( C ~ ITpunt. can analyze whether priors that are uniform in some sphere centered on h' and zero outside of it result in this inequality). Such analysesapparently never carried out by proponents of punting algorithms-are beyond the scope of this paper however. (In addition, they vary from punting algorithm to punting algorithm.) 5.6 Intuitive Arguments Concerning the NFL Theorems and Punting Algorithms. Consider again the algorithm addressed in Theorem (10). For this algorithm, there are two separate kinds of 4:
1. 0, such that the algorithm will almost always punt for a d of sufficient size sampled from 4, or 2. 4 such that the algorithm has tiny expected error when it chooses not to punt.
(Targets 4 with almost no xs such that $(x) = 1 are in the second class, and other targets are in the first class.) It might seem that this breakdown justifies use of the algorithm. After all, for large enough m', if the target is such that there is a nonnegligible probability that the algorithm does not punt, it is not in class 1, so if it does not punt error will be tiny. Thus it would seem that whatever the target (or prior over targets), if the algorithm has not punted, we can be confident in its guess. [Similar arguments can be made when the two classes distinguish sets of P ( @ ) srather than 4s.I However, if we restate this, the claim is that E(C 1 no punt. m ) is tiny for sufficiently large in, for any prior over targets P ( $ ) . (Note that for n >> rn and non-pathological T ( . ) , m' is unlikely to be much less than m.) This would imply, in particular, that it is tiny for uniform P(q5). However from the preceding subsection we know that this is not true. So we appear to have a paradox. To resolve this paradox, consider using our algorithm and observing the no punt signal. Now restate (1) and (2) carefully: In general, either 1. The target is such that for sufficiently large nz' the algorithm will al-
most always punt, but wheiz it does not punt, it usually makes sigizifcaizf errors, or 2. The target is such that the algorithm has tiny expected error when it chooses not to punt.
Lack of Distinctions between Learning Algorithms
1373
Now there are many more 4s in class 1than in class 2. So even though the probability of our no-punt signal is small for each of the 4s in class 1 individually, when you multiply by the number of such 4, you see that the probability of being in class 1, given that you have a no-punt signal, is not worse than the probability of being in class 2, given the same signal. In this sense, the signal gains you nothing in determining in which class you are in, and therefore in determining likely error.3 So at a minimum, one must assume that P ( 4 ) is not uniform to have justification for believing the punt/no punt signal. Now one could argue that a uniform P ( 4 ) is highly unlikely when there is a no-punt signal, i.e., P[no punt I N = uniform P(4).m] is very small, and that this allows one to dismiss this value of a if we see a no punt signal. Formally though, (1 is a hyperparameter, and should be marginalized out: it is axiomatically true that P ( 4 ) = J d o P ( 4 I o)P(a)and is fixed beforehand, independent of the data. So the presence/absence of a punt signal cannot be used to "infer" something about P( d), formally speaking [see the discussions of hierarchical Bayesian analysis and empirical Bayes in Berger (1985) and Bernard0 and Smith (1994)l. More generally, the NFL theorems allow us to "jump a level," so that classes 1 and 2 refer to as rather than 4s. And at this new level, we again run into the fact that there are many more elements in class 1 than in class 2. To take another perspective, although the likelihood P(no punt I class. rn) strongly favors class 2, the posterior need not. Lack of appreciation for this distinction is an example of how computational learning theory relies almost exclusively on likelihood-driven calculations, ignoring posterior calculations. It may be useful to directly contrast the intuition behind the class 1-2 reasoning and that behind the NFL theorems: The class 1-2 logic says that given a 4 with a nonnegligible percentages of Is, it's hugely unlikely to get all 0s in a large random data set. Hence, so this intuitive reasoning goes, if you get all Os, you can conclude that 4 does not have a nonnegligible percentages of Is, and therefore you are safe in guessing 0s outside the training set. The contrasting intuition: say you are given some particular training set, say of the first K points in X, together with associated Y values. Say the Y values happen to be all 0s. Obviously, without some assumption concerning the coupling of 4s behavior over the first K points in X with its behavior outside of those points, 4 could have any conceivable behavior outside of those points. So the fact that it is all 0s has no significance, and cannot help you in guessing. "11 that is being argued in this discussion of classes (1) and (2) is that the absence of a punt signal does not provide a reason to believe error is good. This argument does not directly address whether the presence of a punt signal gives you reason to believe you are in class (l),and therefore is correlated with bad error. The explanation of why there is no such correlation is more subtle than simply counting the number of 4s in each class. It involves the fact that there are actually a continuum of dasses, and that for fixed 4, raising s (so as to get a punt signal) lowers OTS (!) error.
1371
David H. Wolpert
It should be emphasized that none of the reasoning of this subsection directly addresses the issue of whether the punting algorithm has good ”head-to-head minimax” OTS behavior in some sense (see paper two). That is an issue that has yet to be thoroughly investigated. In addition, recall that no claims are being made in this paper about what is (not) reasonable in practice; punting algorithms might very well work well in the real world. Rather the issue is what can be formally established about how w7elI they work in the real world without making any assumptions concerning targets. 5.7 Differences between the NFL Theorems and Computational Learning Theory. Despite the foregoing, there are some similarities bet\veen the NFL theorems and computational learning theory. In particular, \\Then all targets are allowed-as in the NFL theorems-PAC bounds on the error associated with 5 = 0 are extremely poor (Blumer t’t nl.
1987, 1989; Dietterich 1990; Wolpert 1994a). However there are important differences between the NFL theorems and this weak-PAC-bounds phenomenon.
1. For the most part, PAC is designed to give positive results. In particular, this is the case with the PAC bounds mentioned above. (More formally, the bounds in question give an upper bound on the probability that error exceeds some value, not a lower bound.) However lack of a positive result is not the same a s a negative result, and the NFL theorems are full-blown negative results. 2. PAC (and indeed all of computational learning theory) has nothing to say about these data (it,., Bayesian) scenarios. They only concern data-averaged quantities. PAC also is primarily concerned with polynomial versus exponential convergence issues, i.e., asymptotics of various sorts. The NFL theorems hold even if one does not go to the limit, and hold even for these data scenarios. [See also Wolpert (1994a) for a discussion of how PAC’s being exclusively concerned with convergence issues renders its real-world meaningfulness debatable, at best.] 3. The PAC bounds in question can be viewed as saying there is no universally good learning algorithm. They say nothing about the possibility of whether some algorithm 1 may be better than some other algorithm 2 in most scenarios. As a particular example, nothing in the PAC literature suggests that there are as many (appropriately weighted) fs for which a boosted learning algorithm (Drucker Et 01. 1993; Shapire 1990) performs worse than its unboosted version as there are for which the reverse is true. 4. The PAC bounds in question d o not emphasize the importance of a vertical likelihood, they d o not emphasize the importance of homogeneous noise when the target is a single-valued function; they d o
Lack of Distinctions between Learning Algorithms
1375
not emphasize the importance of whether the loss function is homogeneous; they do not invoke "scrambling" (see paper two) for nonhomogeneous loss functions (indeed, they rarely consider such loss functions); they do not concern averaging over pairs of k s (in the sense of Section (4) of paper two), etc. In all this, they are too general. Note that this overgenerality extends beyond the obvious problem that they are "(sampling) distribution free." Rather they are too general in that they are independent of many of the features of a supervised learning problem that are crucially important. 5. Computational learning theory does not address OTS error. Especially when m is not infinitesimal in comparison to n and/or ~ ( x is) highly nonuniform, computational learning theory results are changed significantly if one uses OTS error (see Wolpert 1994a). And even for infinitesimal m and fairly uniform ~ ( x ) many , distributions behave very differently for OTS rather than IID error (see Section 5.2). Appendix A. Detailed Exposition of the EBF This Appendix discusses the EBF in some detail. Since it is the goal of this paper to present as broadly applicable results as possible, care is taken in this Appendix to discuss how a number of different learning scenarios can be cast in terms of the EBF. Notation 0
0
0
In general, unless indicated otherwise, random variables are written using upper case letters. A particular instantiation value of such a random variable is indicated using the corresponding lower case letter. Note though that some quantities (e.g., parameters like the size of the spaces) are neither random variables nor instantiations of random variables, so their written case carries no significance. When clarity is needed, the argument of a P ( . ) will not be used to indicate what the distribution is; rather a subscript will denote the distribution. For example, P F ( ~means ) the prior over the random variable F (targets), evaluated at the value h (a particular hypothesis). This is common statistics notation. (Note that with conditioning bars, this notation leads to expressions like " P A l s ( c 1 d)," meaning the probability of random variable A conditioned on variable B, evaluated at values c and d, respectively.) Also in accord with common statistics notation, "E(A 1 b)" will be used to mean the expectation value of A given B = b, i.e., to mean J' d a a P ( a 1 b). (Sums replace integrals if appropriate.) This means in particular that anything not specified is averaged over. So for example, E(A I b ) = ,[ dc da a P(a I b. c) P ( c I b ) = J dc E ( a I b. c) P ( c 1
Da\rid H. Wolpert
1376
0
b ) . When it is obvious that their value is assumed fixed and what it is fixed to, sometimes I will not specify variables in conditioning arguments. I will use 11 and i' to indicate the (countable though perhaps infinite) number of elements in the set X (the input space) and the set Y (the output space), respectively. (X and Y are the only case in this paper where capital letters do not indicate random variables.) Such cases of countable X and Y are the simplest to present, and always obtain in the real world where data are measured with finite precision instruments and are manipulated on finite size digital computers.
A generic X value is indicated by x, and a generic Y value by y. Sometimes I will implicitly take Y and/or X to be sets of real numbers, sometimes finely spaced. (This is the case when talking about the "expected value" of a Y-valued random variable, for example.j
The Primary Random Variables 0
In this paper, the "true" or "target" relationship between (test set) inputs and (test set) outputs is taken to be an X-conditioned distribution over Y [i.e., intuitively speaking, a P(y 1 x)]. In other words, where S, is defined as the r-dimensional unit simplex, the "target distribution" is a random \miable mapping X + S,. Since X and Y are simply sets and not themselves random variables, this is formalized as follows:
Let F be a random variable taking values in the ri-fold Cartesian product space of simplices s,. Letf be a particular instantiation of that variable, i.e., an element in the n-fold Cartesian product space of simplices S,.Then f can be viewed as a Euclidean vector, with indices given by a value s E X and y E Y. Accordingly, we can indicate a component o f f by writing f(x.y). So for all .Y, y, fix.!/) 2 0, and for any fixed x, X , , f ( sy) = 1. This defines the random variable F. The formal sense in which this P can be viewed as an "X-conditioned distribution over Y" arises in how it is statistically related to certain other random variables (specified below) taking values in X and in Y. 0
0
0
In a similar fashion, the generalizer's hypothesis is an "X-conditioned distribution over Y," i.e., the hypothesis random variable H takes \ d u e s in the ir-fold Cartesian product space of simplices S, and components of any instantiation h of H can be indicated by h(x.!y). If for all s, i i ( s . y ) is a Kronecker delta function (over y), h is called "single-valued," and similarly for f. In such a case, the distribution in question reduces to a single-valued function from X to Y . The value d of the training set random variable is an ordered set of u i input-output pairs, or "examples." Those pairs are indicated by
Lack of Distinctions between Learning Algorithms
0
0
1377
d x ( i ) ,dy(i){i = 1 . .. m}. The set of all input values in d is dx and similarly for d y . m' is the number of distinct values in d x . The cost C is a real-valued random variable. The primary random variables are such target distributions F , such hypothesis distributions H, training sets D, and real-world "cost" or "error" values C measuring how well one's learning algorithm performs. They are "coupled" to supervised learning by imposing certain conditions on the relationship between them, conditions that are discussed next.
The Relationship between C, F, and H , Mediated by Q, YE, and Yff. It will be useful to relate C to F and H using three other random variables. "Testing" (involved in determining the value of C) is done at the X value given by the X-valued random variable Q. Y values associated with the hypothesis and Q are given by the Y-valued random variable YH (with instantiations yH), and Y values associated with the target and Q are given by the Y-valued random variable YF (with instantiations yF). All of this is formalized as follows. 0
0
0
The F random variable parameterizes the Q-conditioned distribution over YF : P(YF1 f . q ) = f ( 4 . y ~ ) . In other words, f determines how test set elements YF are generated for a test set point q. So YF and Q are the random variables whose relationship to F allows F to be intuitively viewed as an "X-conditioned distribution over Y"-see above. The variable YH meets similar requirements: P(YH1 k. 9) = h(q.yH), and this relationship between YH, Q, and H is what allows one to view H as intuitively equivalent to an "X-conditioned distribution over Y." For the purposes of this paper, the random variable cost C is defined by C = L(YH. YF), where L ( . . .) is called a "loss function." As examples, zero-one loss has L(a, b) = 1 - h(a. b), where h(a. b ) is the Kronecker delta function, and quadratic loss has L(u>b ) = (a - b)2. (Zero-one loss is assumed in almost all of computational learning theory.) It is important to note though that in general C need not correspond to such a loss function. For example, "logarithmic scoring" has c = - C,f(q>y) In[k(q.y)], and does not correspond to any
L(YF,YH). 0
For many Ls the sum over y F of b[c. L(yH. yF)] is some function h ( c ) , independent of YH. I will call such Ls "homogeneous." Intuitively, such Ls have no a priori preference for one Y value over another. As examples, the zero-one loss is homogeneous. So is the squared difference between YF and yli if they are viewed as angles, L(YF.YH) = [(YF - YHJ mod TI'.
David H. Wolpert
1378
Note that one can talk of an L’s being homogeneous for certain values of c. For example, the quadratic loss is not homogeneous over all c, but it is for i = 0. The results presented in this paper that rely on homogeneity of L usually hold for a particular c so long as L is homogeneous for that c, even if L is not homogeneous for all c.
The Relationship between F , D and Q Note thatf is a distribution governing test set data (it governs the outputs associated with q), and in general it need not be the same as the distribution governing training set data. Unless explicitly stated otherwise though, I will assume that both training sets and test sets are generated viaf. Often when training and testing sets are generated by the same P(y 1 x), the training set is formed by iterating the following “independent identically distributed” (IID) procedure: Choose X values according to a ”sampling distribution” a(x), and then sample f at those points to get associated Y values.“ More formally, this very common scheme is equivalent to the following ”likelihood,” presented previously as equation 3.1:
Pid
If)
=
P(dY / f . d x ) P ( d xI f ) P(dy I f . I f x j Pi&) (by assumption)
=
IT{i;[dx(i)If[ilx(i).& ( i ) j }
=
11.
r=l
There is no a priori reason for P ( d 1 f ) to have this form, however. For example, in “active learning” or “query-based“ learning, successive LTalues (as i increases) of d x ( i ) are determined by the preceding values of &(i) and dy(i).As another example, typically P ( d I f ) will not obey equation (3.1) if testing and training are not governed by the same P(y 1 x). (Recall that f governs the generation of test sets.) To see this, let t be the random variable P(!y 1 x ) governing the generation of training sets. Then P ( d I f ) = ,I’dfP(tf 1 t ) P ( t I fi. Even if P ( d I t ) = n : ’ ~ , { ~ i [ d , ( i ) ] f [ d x ( i ) . d , ( i ) ] } , unless P ( f I f j is a delta function about t = f , P ( d need have the form specified in equation 3.1.
If)
I will say that P ( d 1 f ) is ”vertical” if it is independent of the values of f(s @ d x ) . Any likelihood of the form given in equation (3.1) is vertical, by inspection. In addition, as discussed in Section 5, active ‘In general, ti itself could be a random variable that can be estimated from the data, that is perhaps coupled to other random variables (e.g.,f),etc. However here I make thc usual assumption in the neural net and computational learning literature that 71 is fixed. This is technically known as a “filter likelihood,” and has powerful implications ( w e Wolpert 1994b).
Lack of Distinctions between Learning Algorithms
0
1379
learning usually has a vertical likelihood. However, some scenarios in which t #f do not have vertical likelihood^.^ In the case of ”IID error” (the conventional error measure), P(q I d) = ~ ( q )In . the case of OTS error, P(q I d) = [S(9 $! d x ) n ( q ) ] / [&6(q $! d x ) ~ ( q ) where ], b ( z ) = 1 if z is true, 0 otherwise. Strictly speaking, OTS error is not defined when m‘ = n. Where appropriate, subscripts OTS or IID on c will indicate which kind of P(q I d) is being used.
Function + Noise Targets 0
In this paper I will consider in some detail those cases where we only allow thosef that can be viewed as some single-valued function 4 taking X to Y with a fixed noise process in Y superimposed.6 To do this, I will (perhaps only implicitly) fix a noise function N that is a probability distribution over Y, conditioned on X x Y; N is a probability distribution over yF, conditioned on the values of q and $(q). [Note that there are rn such functions Ij(.).]
Given N(.), each 4 specifies a unique f$,via P(yf I f4.q ) = f4(9. y ~ =) N[YFI q.d(q)] = P(YFI q34).Accordingly, all the usual rules concerningf apply as well to 4. [For example, P(h 1 d > 4 )= P(h 1 d).] When I wish to make clear what 4 setsf, I will writef4, as above; 4 simply serves as an index o n f . [In general, depending on N(.), it might be that more than one (h labels the samef, but this will not be important for the current analysis.] So when I say something like ”vertical P(d I 4)’’it is implicitly understood that I mean vertical P ( d 1 f4). 0
0
When I say that I am ”only allowing” these kinds off, I will mean that whenever ’If” is written, it is assumed to be related to a 0 in this manner-all other f implicitly have an infinitesimal prior probability. Note that the N(.) introduced here is the noise process operating in the generation of the test set, and need not be the same as the noise
5As an example, assume that f is some single-valued function from X to Y, 9, so that P(YFI [email protected] j = 6 [ y ~4(4)]. . However, assume that d is created by corrupting Q with both noise in X and noise in Y. This can be viewed as a “function + noise” scenario where the noise present in generating the training set is absent in testing. (This case is discussed in some detail below.) As an example of such a scenario, viewing any particular pair of X and Y values from the training set as random variables X i and Yt, one might have Y t = Cx,y(XtsX’)4(X’) + E, where X’ is a dummy X variable, y(... j is a convolutional process giving noise in X, and E is a noise process in Y. (Strictly speaking, this particular kind of Y-noise requires that r = x,as otherwise Ex, y(x.x’jd(x‘j + E might not lie in Y.) For this scenario, t ff.In addition, P(d I f ) does not have the form given in equation 3.1. In particular, due to the convolution term, P(d I f ) will depend on the values of f = 4 for x @ d x ; the likelihood for this scenario is not vertical. 6Noise in X of the form mentioned in footnote 5 will not be considered in this paper. The extension to analyze such noise processes is fairly straightforward however.
David H. Wolpert
1380
0
process in the generation of the training set. As an example, it is common in the neural net literature to generate the training set by adding noise to a single-valued function from X to Y, o(.), but to measure error by how well the resulting h matches that underlying o(.), not by how well YH values sampled from h match Yr values formed by adding noise to o ( ' ) .In the o-N terminology, this would mean that although P(rf I f ) may be formed by corrupting some function o(.) with noise (in either X and/or Y), P(yF 1 f.9). which is used to measure test set error, is determined by a noise-free N(.), NIyr 1 q . oiq)] = q y , . oiq)]. Of special importance will be those noise-processes for which for each q, the uniform o-average of P(y/-1 q. o ) is independent of y, . (Note this does not exclude q-dependent noise processes). I will call such a (test-set) noise process "homogeneous." Intuitively, such noise processes have no a priori preference for one Y value over another. As examples, the noise-free testing mentioned just above is homogeneous, as is a noise process that when it takes in a value of 0(9), produces the same value with probability z and all other Lralues with (identical) probabilities (1 - z)/(r - 1).
Coupling All This to Supervised Learning 0
0
0
Any (!) learning algorithm (or "generalizer") is simply a distribution P ( h 1 d ) . It is "deterministic" i f the same d always gives the same lz [i.e., if for fixed rl, P(lr 1 11) is a delta function about one particular 111. There are many equalities that are assumed in supervised learning, but that d o not merit explicit delineation. For example, it is implicitly assumed that P(It 1 q.il) = P ( / I 1 d), and therefore that Pi[! d . h ) = P i q 1 d \ . One assumption that does merit explicit delineation is that P ( h 1 f . d ) = P(It 1 d ) (i.e., the learning algorithm only sees d in making its guess, not f ) . This means that P(I1.f I i f ) = P ( h 1 d ) P(f 1 d ) , and therefore P ( - f )1 t . d ) = P(fI d ) .
As an example of the importance of this assumption, note that it implies that P(yr 1 y,!. d . q ) = Ply/ j [ I . q ) . Proof. ExpandP(yF y f i . ~ l . q = ) 1 tfff(q.y/)P(f i Lf.171 1' [ f h h ( q . y , ~ ) . P (1 h f.d.q). Since P ( h j f.d.q) = Pill I il), this integral is proportional to I Lfff(q.yF) P(f 1 d . q ) , where the proportionality constant depends on d , ylf, and q. However .I i!ff(q.yi)P(f 1 d . q i = P(y/ 1 d . q ) . Due to normalization, this means that the proportionality constant equals 1, and we have established the proposition. QED. Our assumption does not imply that Piyi j yH. d 1 -- P(y/ I d ) , however. Intuitively, for a fixed learning algorithm, knov:ii.g !it, a n d L / tells you
Lack of Distinctions between Learning Algorithms
1381
something about q, and therefore (in conjunction with knowledge of d) something about YF, that d alone does not. 0
The ”posterior” is the Bayesian inverse of the likelihood, P(f 1 d). The phrase “the prior” usually refers to P(f). Some schemes can be cast into this framework in more than one way. As an example, consider softmax (Bridle 1989), where each output neuron indicates a different possible event, and the real values the neurons take in response to an input are interpreted as input-conditioned probabilities of the associated events. For this scheme one could either (1)take Y to be the set of “possible events,” so that the k produced by the algorithm is not single-valued, or ( 2 ) take Y to be (the computer’s discretization of) the real-valued vectors that the set of output neurons can take on, in which case k is single-valued, and Y itself is interpreted as a space of probability distributions. Ultimately, which interpretation one adopts is determined by the relationship between C and H. (Such relationships are discussed below.)
“Generalization Error“ 0
Note that E(C 1 f . k . d )
=
CY,,.YFqE(C 1 f . k . d . ~ ~ . y ~ . q ) P ( y ~ . y1 r . q
f.k . d ) . Due to our definition of C, the first term in the sum equals L ( y ~ . p ) .The second term equals P(YHI k . q . f . d . y r ) P ( y f I 4.f.d. k ) P(q 1 d.f k ) . This in turn equals k(q. y ~ ) f ( qYF) . P(q 1 f . h. d). In addition, P(q If. k . d ) = P(9 I d) always in this paper. Therefore E(C I f . k - d ) = cy,,yF,y L(YH3 YF) k ( 9 >Y H ) f ( 9 .YF) P(9 I d ) . In much of supervised learning, an expression like that on the righthand side of this equation is called the ”generalization error.” In other words, instead of the error C used here, in much of supervised learning one uses an alternative error C’, defined by C(f.k . d) E E(C I f . k . d ) , i.e., P(c’ 1 f . k . d ) = h[c’.E(C 1 f . k . d ) ] . Note that in general, the set of allowed values of C is not the same as the set of allowed values of C’. In addition, distributions over C do not set those over C’. For example, knowing P(c I d) need not give P(c’ 1 d) or vice versa.7 However many branches of supervised 71f there are only two possible L(., .) values (for example), P(c‘ I d) does give P(c 1 d). This is because P(c’ 1 d) gives E(C’ I d) = E(C 1 d) (see below in Appendix A), and since there are two possible costs, E(C 1 d) gives P(c I d). It is for more than two possible cost values that the distributions P(c’ 1 d ) and D(c 1 d) do not determine one another. In fact, even if there are only two possible values of L ( . . .), so that P(c’ I d) sets P(c I d), it does not follow that P(c 1 d) sets P(c’ 1 d ) . As an example, consider this case where IZ = m’+2, and we have zero-one loss. Assume that given some d, P(f I d) and P(h 1 d) are such that either h agrees exactly withf for OTS q or the two never agree, with equal probability. This means that for zero-one OTS error, P(c 1 d) = S(c,O)/2 + O(c. 1)/2. However we would get the same distribution if all four possible agreement relationships between h andf for the off-training set q were possible, with equal probabilities. And in that
1382
David H. Wolpert
learning theory (e.g., much of computational learning theory) are concerned with quantities of the form ”P(error > E 1 . . .).’’8 For such quantities, whether one takes ”error” to mean C or (as is conventional) C’ may change the results, and in general one cannot directly deduce the result for C from that for C’ (or vice versa). Where appropriate, subscripts OTS or IID on c’ will indicate which kind of P(q I d) is being used. 0
Fortunately, most of the results derived in this paper apply equally well to both probabilities of C and probabilities of C’. For reasons of space though, I will work out the results explicitly only for C. However, note that we can immediately equate expectations of C that are not conditioned on q, YH, or y~ with the same expectations of C’. For example,
E(C I d )
= = =
1 1 1
dhdfE(C If.h.d)P(f.h I d ) dhdfC’(f.h.d)P(f,h 1 d ) dhdfE(C’ l f . h . d ) P ( f . h I d ) = E(C’ I d )
So when cast in terms of expectation values, any (appropriately conditioned) results automatically apply to C’ as well as C.
Miscellaneous 0
0
For most purposes, it is implicitly assumed that no probabilities equal zero exactly (although some probabilities might be infinitesimal). That way we have never have to worry about dividing by probabilities, and in particular never have to worry about whether conditional probabilities are well-defined. So as an example, phrases like ”noise-free” are taken to mean infinitesimal noise rather than exactly zero noise. Similarly, where needed, integrals over f are implicitly restricted away from fs having one or more components equal to zero. It is important to note that in general, for nonpathological 7 r ( . ) , in the limit where FZ >> r, distributions over cIlD are identical to distributions over c&. In this sense theorems concerning OTS error immediately carry over to IID error. This is proven formally in Appendix B.
second case, we would have the possibility of C’ values that are impossible in the first case (e.g., c = 1/2). QED. “In general, whether ”error” means C or C’, this quantity is of interest only if the number of values error can have is large. So for example, it is of interest for C’ if r is large and we have quadratic loss.
Lack of Distinctions between Learning Algorithms
1383
Appendix B. Proof That Distributions Over Cf,, Equal Those Over CbTS Whenever n >> Y, for Nonpathological a(-) To prove the assertion, with slight abuse of terminology write Cf,, = CbTS~(X - dx) Cks.ir(dx),where ”TS” means error on the training set, defined in the obvious way, and T ( A ) = C,,,a(x) (see Wolpert et al. 1995). Then for any set of one or more random variables Z taking value z, we have
+
Now again abuse terminology slightly and write
where the statistical dependencies of CbTSand Cks are made explicit by writing them as functions. Plugging in we get
Define E F maxdxr(dx), so mindxT(X - d x ) = 1 - E . Now whenever n >> m, so long as there are no sharp peaks in T ( . ) , E + 0. However, because a delta function is not a continuous function, taking the limit as E -+ 0 of our expression for P(cf,, I z ) is not immediately equivalent to setting the a(X - d x ) and .ir(dx) inside the delta function to 1 and to 0, respectively. We can circumvent this difficulty rather easily though. To do that, first define
David H. Wolpert
1384
Then write
x P(d,f, h 1 z )
for some f i where I f i / 5 6. Now for nonpathological z, P(c;,, I z ) is a continuous function of c;,, as n and/or T ( . ) are varied. (Recall, in particular, that in this paper, no event has exactly zero probability; see Appendix A.) So for such a Z, for E sufficiently small, h + 0 and therefore f i --t 0, and we can approximate
But this just equals Pc;,T512(c~lD 1 z). So for n >> rn and nonpathological z and T ( . ) , the distribution over c;DD is the same as that over cLTs. QED.
Appendix C. Miscellaneous Proofs For clarity of the exposition, several of the more straightforward proofs in the paper are collected in this appendix.
If.
Proof of Lemma 1. Write P(c d ) = C , P(c 1 !/H.yr. q.f. d ) P ( ~ IH Rewriting the summand, we get P(c I f.d) = ~ ~ / , i . y.ql b[c. L ( ! / H . Y F ) ] p(!/H 1 !/F*f. 4. d ) p(yF. 4 1 f d ) . NOW P(YHI ! / ~ . f % q . d=) , I dhP(yH 1 y ~ . h . f . q . d ) P ( h I ~ p . f . 9 . d ) = ,I' d h P ( y j , I h . q . d ) P ( h I q . d ) [see point (11) in the EBF section]. This just equals P(YH I 4 . d ) . [However it is not true in general that P(yf, I ! l F . d )= P(yH I d ) (see Wolpert d al. 1995).] Plugging in gives the result. QED.
.y p . q ..f . d ) P ( y p , q I f . d ) .
-
Proof of the Claim Concerning the "Random Learning Algorithm," ,4 h[c. L(yl,. Made Just Below Lemma (1).By Lemma 1, P(c d ) = &,I,Cr ! / r ) ]P(!/JII 9. d ) P(YFI 4.f)P ( q I d ) . However, for OTS error q @ d x , and therefore for the random learning algorithm, for all q and d in our sum, P(yH I 4.11) = l / u , independent of YH (recall that there are Y elements in Y). If we have a symmetric homogeneous loss function, this means that we can replace & / F j h[c. L(yH.yr)] P(yH I q. d ) with A(c)/v. Since this is independent of d a n d f , P ( c I d ) = A(c)/r for all training sets d, as claimed. QED.
If.
Proof of the "Implication of Lemma (l)," Made Just Below Lemma (1). Uniformly average the expression for P(c I f.d) in Lemma (1)over all targetsf. The only placef occurs in the sum in Lemma (1)is in the third term, P ( ~ I F9.1). Therefore our average replaces that third term with some function func(yF.4). By symmetry though, the uniform f-average of that third term in the sum must be the same for all test set inputs
Lack of Distinctions between Learning Algorithms
1385
q and outputs yF. Accordingly func(yF.q) is some constant. Now the sum over YF of this constant must equal 1 [to evaluate that sum of the f-average of P(yF 1 q . f ) , interchange the sum over YF with the average overf]. Therefore our constant must equal l / v . The implication claimed is now immediate. QED. Proof of Theorem (2). We can replace the sum over all q that gives P(c I f . d ) [Lemma (1)l with a sum over those q lying outside of d x . Accordingly, for such a P(q 1 d), P(c 1 f . d ) is independent of the values of f ( x E d x ) . (For clarity, the second argument off is being temporarily suppressed.) Noting that P(d I f ) is vertical, next average both sides of our equation for P(c 1 f . m ) uniformly over all f and pull the f-average inside the sum over d. Since P(c 1 f.d) and P(d 1 f ) depend on separate parts off [namely f ( x $ d x ) a n d f ( x E d x ) , respectively], we can break the average over f into two successive averages, one operating on each part off, and thereby get
But since P ( c I f . d) is independent of the values of f ( x E d x ) , uniformly averaging it over allf(x $ d x ) is equivalent to uniformly averaging it over all f . By Theorem (l),such an average is independent of d. Therefore we can pull that average out of the sum over d, and get Theorem ( 2 ) . QED. Proof of Theorem (3). Write P ( c I d) for a uniform P ( f ) x J’ d f P ( c 1 d , f ) P(d If), where the proportionality constant depends on d . Break up the integral overf into an integral overf(x1 E d x ) and one overf(x @ d x ) , exactly as in the proof of Theorem ( 2 ) . Absorb d f ( x E d x ) P(d 1 f ) into the overall (d-dependent) proportionality constant. By normalization, the resultant value of our constant must be the reciprocal of df(x @ d x ) l . QED.
.r
Proof of Theorem (7).J d a P ( c 1 nz.0) = J dct[&,,P($ 1 m,ck)P(c 1 rn. 0.4)], where the integral is restricted to the r”-dimensional simplex. This can be rewritten as J’ d ~ [ & ,06 P(c I q5. rn. o.)],since we assume that the probability of d, has nothing to do with the number of elements in d. Similarly, once 4 is fixed, the probability that C = c does not depend on (Y, so our average equals J’ da[& a d P(c 1 @ m ) ] . Write this as &,P(c I 4>m ) [[ da n4]. By symmetry, the term inside the square brackets is independent of 4. Therefore the average over all P ( 4 ) of P(c 1 rn) is proportional to & P(c I 4.m). Using Theorem (5) and normalization, this establishes Theorem (7). QED. Proof of Corollary (3). Follow along with the proof of Theorem (7). Instead of J’ dtr 08, we have J da G(tr)ab. (For present purposes, the delta and Heaviside functions that force (L to stay on the unit simplex
David H. Wolpert
1386
arc implicit.) By assumption, G(tr) is unchanged under the bijection of replacing all vectors o , with new vectors identical to the old, except that the components for i = o and i = o' are interchanged. This is true for all C'I and tl. Accordingly, our integral is independent of o, which suffices to prove the result. QED. Proof of Theorem (9). To evaluate P ( c I s . d ) for uniform P(f),write it as ,I d f P ( c 1 s.Li.f)P(f 1 d . s ) . Next write P( f 1 d . s ) = P ( s I f.d)P(f 1 d ) / P ( s I d ) . Note though that P ( s d ) = P ( s 1 dj (see beginning of Section S), and recall that we are implicitly assuming that P ( s 1 d ) # 0). So we get P ( c 1 s . d ) = 1 d f P ( c 1 s . d . f ) P ( d 1 f),up to an overall d-dependent proportionality constant. Now proceed as in the proof of Theorem ( 2 ) by breaking the integral into two integrals, one overf(x E dx), and one over f i x 4 d x ) . The result is P ( c j s.d,! = .\(c)/r, up to an overall d-dependent proportionality constant. By normalization, that constant must equal 1. This establishes Theorem (9). QED.
If.
Example of Non-NFL Behavior of s-Conditioned Distributions. Let P ( h 1 d ) = $ ( h . / i - ) for some Ii', let ~ ( sbe) uniform, use zero-one loss, assume a noise-free IlD likelihood, and take ni = 1. Then we can write E(C<>TS 1 s.f.m = 1) = E(C<JTs 1 s.f.??i = 1.h = / I * ) = [ n C { ~ D ( f . h * ) .sj ( i i - 1). [Note that Ci,D is independent of d, and that for zero-one loss 17 x C;,[](f. is the number of disagreements between h* andf over all of X.] No matter what f is, this grows as s shrinks. Since CoTs can have only two values, this means that as s grows, P(COTS1 f.s.111 = 1) gets biased toward the lower of the two possible values of COTS. So we do not have NFL behavior for the uniform average over f of P ( c 1 f .s. m)-that average depends on the empirical error s. / I % )
Proof That Active Learning Has a Vertical Likelihood. Let dk refer to the first k input-output pairs in the training set d, and d(i) to the ith such pair. Then Pid,,, 1 f ) = P [ d ( m i I f.d,,,-~]P(d,,, - 1 I f ) = P i d ~ ( n 1 )I d ~ ( m ) . f . d , , , -P[dxjni) ~] 1 f . i f , , , - ~ ] P ( d , , , -I ~f). By hypothesis, in active learning P[dX(m)1 f.d,,,-,] = P[dx(in)I cf,,,-lj. So long as it is also true that Pjdy(771) 1 dx(tn).f.dl,l-lj = P [ d y ( t n )1 dX(m).f]is independent of f [ x # dxini)], by induction we have a vertical likelihood.
Appendix D. Proof of Theorem (8) -
The task before us is to calculate the average over all (t of P ( c 1 d. 0 ) . To that end, write the average as (proportional to) I dtr[C,P(d 1 d . c ~ ) P ( cI o.d. ( I I!, where as usual the integral is restricted to the r"-dimensional simplex. Rewrite this integral as f dn[C,P(o J c i ) P ( c1 ( 2 . d . o ) P ( d J d.(i)/ Pid 1 ( t I] = dr\[C,,n.,P(c 1 o.d ) P ( d I 0)]/[1,~ rr,fP(d I o')],where d is a
Lack of Distinctions between Learning Algorithms
1387
dummy 4 value. Rewrite this in turn as
As in the proof of Theorem (2), break up $ into two components, 41 and 42, where $1 fixes the values of Q; over the X values lying inside d x , and $2 fixes it over the values outside of dx. We must find how the terms in our sum depend on $1 and 4 2 . First, write P(c I 4.d) = C h P ( h I d ) P ( c 1 h , 4 . d ) . By definition, for OTS error P(c 1 h , $ , d ) is independent of 41. This allows us to write P(c I 4 , d ) = P ( c I 4 2 . 4 . Next, since we are restricting attention to vertical likelihoods, P(d 1 d) depends only on $ l . So we can write the term in the curly brackets as d n [ a ~ , ~ * / C " I " ; a s ~ ~ ~I P411 P ( d= d + d C $ ayP(d I 4)l with obvious notation. Since we are assuming that for no $ does P(d I 4) equal zero exactly, the denominator sum is always nonzero. Now change variables in the integral over a by rearranging the 42 indices of a. In other words, d1 and 4 2 are a pair of discrete-valued vectors, and N is a real-valued vector indexed by a value for Q;1 and one for qb2; transform (Y so that its dependence on 4 2 is rearranged in some arbitrary-though invertible-fashion. Performing this transformation is equivalent to mapping the space of all $2 vectors into itself in a one-to-one manner. The Jacobian of this transformation is 1, and the transformation does not change the functional form of the constraint forcing N to lie on a simplex (i.e., C9,42,, ag,b2,,= 1 and for all 4142~, abld2,,2 0, where doubleprime indicates the new 4 2 indices). So expressed in this new coordinate system, the integral is J dn{a~,,s2,/ Em; a,;P(d I &)}, where 4; is a new index corresponding to the old index 4 2 . Since this integral must have the same value as our original integral, and since 4; is arbitrary, we see that that integral is independent of 42, and therefore can only depend on the values of d and $l. This means that we can rewrite our sum over all #I as
s
s
c
P(C
I $ 2 , d)P(d I 41) func1{41,4
I"I&
for some function "funcl(-)." In other words, the a-average of P(c 1 d , a ) is proportional to P ( c I $2, d ) , where the proportionality constant depends on d. Since P(c 1 4, d ) = P(c I 42.d) (see above), our sum is proportional to &,m2P(c I 4,d) = &,P(c 1 4 . d ) . By Theorem (4), this sum equals A(c)/r. So the uniform a-average of P(c I d, a ) = funcZ(d)A(c)/r for some function "func2(.)." Since C , P(c I d , CY) = 1, the sum over C values of the uniform n-average of P(c 1 d, a ) must be independent of d (it must equal 1).Therefore funcZ(d) is independent of d . Since we know that h ( c ) / r is properly normalized over c, we see that funcn(d) in fact equals 1. QED.
David H. Wolpert
1388
Appendix E. Proof of Theorem (10) First use the fact that given q, d x determines whether there is a punt signal, to write
€[CoTs I 4.(no) punt. ml
E(COTSI d. 4 )
= ‘1,
x
P [ d x I (no) punt. o. 1111
(E.1)
Next, without loss of generality, let the xs for which o(x) = 0 be 1... . . k, so that p(x) = 1for x = k + l . . . . n. Then P(dx I no punt. 0. in) = 0 unless all the d x ( i ) 5 k . Since T ( X ) is uniform, and d is ordered and perhaps has repeats, the value of P(dx I no punt. 41. r n ) when all the d x ( i )5 k is k-“’. Similarly, P ( d x I [email protected])= 0 unless at least one of the d x ( i )> k , and when it is nonzero it equals some constant set by k and nz. It’s also true that €( CoTs 1 Q. d x ) is not drastically different if one considers d x s with a different m’. Accordingly, our summand does not vary drastically between dxs of one m‘ and dxs of another. Since 11 >> 772 and ~ ( sis) uniform though, almost all of the terms in the sum have rn’ = 111. Pulling this all together, we see that to an arbitrarily good approximation (for large enough i i relative to m), we can take m’ = m. So E.l becomes ~
E[CUTSI
(/A
(no) punt. m]
E(COTS1 4.d ~ )
= dX
x
P[dx I (no) punt. m. m’ = m ]
(E.2)
Now consider conditioning on ”no punt,” in which case all the d x ( i ) 5 = 111, €(COTS I &.&) = ( i f - k ) / ( n - rn). In contrast, consider having a punt signal, in which case at least one d x ( i ) > k. NOWE(&s 1 & d x ) 5 ( 1 1 - k - l ) / ( r ? - W I ) < ( n - k ) / ( n - V I ) . Combining this with E.2, we get E ( C ~ TI ~~5.punt.m)< €(COTS I Q.no punt. m ) . QED.
k. For such a situation, for m’
Acknowledgments I would like to thank Cullen Schaffer, Wray Buntine, Manny Knill, Tal Grossman, Bob Holte, Tom Dietterich, Karl Pfleger, Mark Plutowski, Bill Macready, Bruce Mills, David Stork, and Jeff Jackson for interesting discussions. This work was supported in part by the Santa Fe Institute and by TXN Inc.
References
Anthony M., and Biggs N. 1992. Computational Lcnrniizg Theory. Cambridge University Press, Cambridge.
Lack of Distinctions between Learning Algorithms
1389
Berger, J. 1985. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, Berlin. Berger, J., and Jeffreys, W. 1992. Ockham’s razor and Bayesian analysis. A m . Sci. 80, 64-72. Bernardo, J. Smith, A. 1994. Bayesian Theory. John Wiley, New York. Blumer, A,, et al. 1987. Occam’s razor. liform. Process. Lett. 24, 377-380. Blumer, A,, et al. 1989. Learnability and Vapnik-Chervonenkis dimension. J. A C M 36, 929-965. Bridle, J. 1989. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In N p z m Computing: Algorithms, Architectures, and Applications, F. Fougelman-Soulie and J. Herault eds., Springer-Verlag, Berlin. Dietterich, T. 1990. Machine learning. A n n u . Rev. Comput. Sci. 4, 255-306. Drucker, H. et al. 1993. Improving performance in neural networks using a boosting algorithm. In Neiiral Information Processing Systems 5, S . Hanson et al. eds. Morgan Kaufmann, San Mateo, CA. Duda, R., and Hart, P. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Hughes, G. 1968. On the mean accuracy of statistical pattern recognizers. I E E E Transac. Inform. Theory IT-14, 55-63. Kearns, M. J. 1992. Towards efficient agnostic learning. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, ACM Press, New York. Mitchell, T. 1982. Generalization as search. Art$ Intell. 18, 203-226. Mitchell T., and Blum, A. 1994. Course Notes for Machine Learning, CMU. Murphy, P., and Pazzani, M. 1994. Exploring the decision forest: An empirical investigation of Occam’s razor in decision tree induction. 1.Artif. lritell. Xes. 1, 257-275. Natarajan, B. 1991. Machine Learning: A Theoretical Approach. Morgan Kaufmann, San Mateo, CA. Perrone, M. 1993. Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Ph.D. thesis, Brown Univ., Physics Dept. Plutowski, M. 1994. Cross-validation estimates integrated mean squared error. In Advances in Neiiral lnforrnation Processing Systems 6 , Cowan et al. eds. Morgan Kaufmann, San Mateo, CA. Schaffer, C. 1993. Overfitting avoidance as bias. Machine Learn. 10, 153-178. Schaffer, C. 1994. A conservation law for generalization performance. In Machine Learning: Proceedings of the Eleventh International Conference, Cohen and Hirsh, eds. Morgan Kaufmann, San Mateo, CA. Schapire, R. 1990. The strength of weak learnability. Machine Learn. 5, 197-227. Vapnik, V. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin. Vapnik, V, and Bottou, L. 1993. Local algorithms for pattern recognition and dependencies estimation. Neiiral Comp. 5, 893-909. Waller, W., and Jain, A. 1978. On the monotonicity of the performance of Bayesian classifiers. I E E E Transact. Inform. Theory IT-24, 392-394.
David H. Wolpert
1390
Watanabe, S. 1985. Pattern Recognition: Human and Medinnical. John Wiley, New York. Weiss, S. M., and Kulikowski, C. A. 1991. Coniyuter Systciris that Learn. Morgan Kaufmann, San Mateo, CA. Wolpert, D. 1992. On the connection between in-sample testing and generalization error. Coniplex S!yst. 6, 47-94. Wolpert, D. 1993. On O z y f i t t i q AzJoidnncens Bins. Tech. Rep. SFI TR 93-03-016. Wolpert, D. 1994a. The relationship between PAC, the Statistical Physics framework, the Bayesian framework, and the VC framework. In The M n t h m n t i c s of Geriernlizntion, D. Wolpert ed., Addison-Wesley, Reading, MA. Wolpert, D. 1994b. Filter likelihoods and exhaustive learning. In Coniyiitntionnl Learning Theor!/ and Natiirnl Lenrning Systenis: Voliime 11, S. Hanson et al. eds. MIT Press, Cambridge, MA. Wolpert, D. 1995. On the Bayesian ”Occam factors” argument for Occam’s razor. In Cotnputational Learning Theory and Nntirral Lenrniiig Slystems: Voliime 111, T. Petsche et al. eds. MIT Press, Cambridge, MA. Wolpert, D., and Macready, W. 1995. No Free Liincli T/ieoretiisfor Search. Tech. Rep. SFI TR 95-02-010. Submitted. Wolpert, M., Grossman, T., and Knill, E. 1995. Off-training-set error for the Gibbs and the Bayes optimal generalizers. Submitted. ~~
Received August 18, 1995, accepted February 14, 1996
This article has been cited by: 1. Chrysovalantis Gaganis, Fotios Pasiouras, Michael Doumpos, Constantin Zopounidis. 2010. Modelling banking sector stability with multicriteria approaches. Optimization Letters 4:4, 543-558. [CrossRef] 2. Robert L. Goldstone, David Landy. 2010. Domain-Creating Constraints. Cognitive Science 34:7, 1357-1377. [CrossRef] 3. A. V. Kelarev, J. L. Yearwood, P. Watters, X. Wu, J. H. Abawajy, L. Pan. 2010. Internet security applications of the Munn rings. Semigroup Forum 81:1, 162-171. [CrossRef] 4. Pedro G. Espejo, Sebastián Ventura, Francisco Herrera. 2010. A Survey on the Application of Genetic Programming to Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40:2, 121-144. [CrossRef] 5. A. V. KELAREV, P. WATTERS, J. L. YEARWOOD. 2009. REES MATRIX CONSTRUCTIONS FOR CLUSTERING OF DATA. Journal of the Australian Mathematical Society 87:03, 377. [CrossRef] 6. Edwin Lughofer, James E. Smith, Muhammad Atif Tahir, Praminda Caleb-Solly, Christian Eitzinger, Davy Sannen, Marnix Nuttin. 2009. Human–Machine Interaction Issues in Quality Control Based on Online Image Classification. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 39:5, 960-971. [CrossRef] 7. Albert Orriols-Puig, Ester Bernadó-Mansilla. 2009. Evolutionary rule-based systems for imbalanced data sets. Soft Computing 13:3, 213-225. [CrossRef] 8. D. M. Rocke, T. Ideker, O. Troyanskaya, J. Quackenbush, J. Dopazo. 2009. Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics 25:6, 701-702. [CrossRef] 9. Joanna J. Bryson. 2008. Embodiment versus memetics. Mind & Society 7:1, 77-94. [CrossRef] 10. Michael Doumpos, Constantin Zopounidis. 2007. Model combination for credit risk assessment: A stacked generalization approach. Annals of Operations Research 151:1, 289-306. [CrossRef] 11. Ralph van Dinther, Roy D. Patterson. 2006. Perception of acoustic scale and size in musical instrument sounds. The Journal of the Acoustical Society of America 120:4, 2158. [CrossRef] 12. D.H. Wolpert, W.G. Macready. 2005. Coevolutionary Free Lunches. IEEE Transactions on Evolutionary Computation 9:6, 721-735. [CrossRef] 13. David R. R. Smith, Roy D. Patterson, Richard Turner, Hideki Kawahara, Toshio Irino. 2005. The processing and perception of size information in speech sounds. The Journal of the Acoustical Society of America 117:1, 305. [CrossRef]
14. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 15. Aki Vehtari , Jouko Lampinen . 2002. Bayesian Model Assessment and Comparison Using Cross-Validation Predictive DensitiesBayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities. Neural Computation 14:10, 2439-2468. [Abstract] [PDF] [PDF Plus] 16. N.S.V. Rao. 2001. On fusers that perform better than best sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:8, 904-909. [CrossRef] 17. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 18. M. Koppen, D.H. Wolpert, W.G. Macready. 2001. Remarks on a recent paper on the "no free lunch" theorems. IEEE Transactions on Evolutionary Computation 5:3, 295-296. [CrossRef] 19. Malik Magdon-Ismail . 2000. No Free Lunch for Noise PredictionNo Free Lunch for Noise Prediction. Neural Computation 12:3, 547-564. [Abstract] [PDF] [PDF Plus] 20. David Greenhalgh, Stephen Marshall. 2000. Convergence Criteria for Genetic Algorithms. SIAM Journal on Computing 30:1, 269. [CrossRef] 21. Zehra Cataltepe , Yaser S. Abu-Mostafa , Malik Magdon-Ismail . 1999. No Free Lunch for Early StoppingNo Free Lunch for Early Stopping. Neural Computation 11:4, 995-1009. [Abstract] [PDF] [PDF Plus] 22. Huaiyu Zhu , Wolfgang Kinzel . 1998. Antipredictable Sequences: Harder to Predict Than Random SequencesAntipredictable Sequences: Harder to Predict Than Random Sequences. Neural Computation 10:8, 2219-2230. [Abstract] [PDF] [PDF Plus] 23. David H. Wolpert. 1997. On Bias Plus VarianceOn Bias Plus Variance. Neural Computation 9:6, 1211-1243. [Abstract] [PDF] [PDF Plus] 24. D.H. Wolpert, W.G. Macready. 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1:1, 67-82. [CrossRef] 25. David H WolpertBayesian and Computational Learning Theory . [CrossRef]
ARTICLE
Communicated by Steven Nowlan
The Existence of A Priori Distinctions Between Learning Algorithms David H. Wolpert' The Santa Fe Institlife, 2399 Hyde Park Rd., Santa Fe, NM,87501, USA
This is the second of two papers that use off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. The first paper discusses a particular set of ways to compare learning algorithms, according to which there are no distinctions between learning algorithms. This second paper concentrates on different ways of comparing learning algorithms from those used in the first paper. In particular this second paper discusses the associated a priori distinctions that do exist between learning algorithms. In this second paper it is shown, loosely speaking, that for loss functions other than zero-one (e-g., quadratic loss), there are a priori distinctions between algorithms. However, even for such loss functions, it is shown here that any algorithm is equivalent on average to its "randomized" version, and in this still has no first principles justification in terms of average error. Nonetheless, as this paper discusses, it may be that (for example) cross-validation has better head-to-head minimax properties than "anti-cross-validation" (choose the learning algorithm with the largest cross-validation error). This may be true even for zero-one loss, a loss function for which the notion of "randomization" would not be relevant. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors over targets. Accordingly they prove, as a particular example, that crossvalidation cannot be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anti-cross-validation rather than cross-validation (!). 1 Introduction
Can one get something for nothing in supervised learning? Can one get useful, caveat-free theoretical results that link the training set and the learning algorithm to generalization error, without making assumptions concerning the target? More generally, are there useful practical techniques that require no such assumptions? As a potential example "In memory of TaI Grossman. Neural Coriipiitntioii 8, 1391-1420 (1996) @ 1996 Massachusetts Institute of Technology
1392
David H. Wolpert
of such a technique, note that people usually use cross-validation without making any assumptions about the underlying target, as though the technique were universally applicable. This is the second of two papers that present an initial investigation of this issue. These papers can be Liewed as an analysis of the mathematical “skeleton” of supervised learning, before the ”flesh” of particular priors over targets and similar problem-specific distributions are introduced. It should be emphasized that the work in these papers is very preliminary; much remains to be done. The primary mathematical tool used in these papers is off-training set (OTS) generalization error, i.e., generalization error for test sets that contain no overlap with the training set. (in the conventional measure of generalization error such overlap is allowed.) Section 2 of the first paper explains why that is an apprqxiate tool to use. Paper one then uses that tool to elaborate (some of) the senses in which all learning algorithms are a priori equivalent to one another. This is done by comparing algorithms via averages over targets for ”homogeneous” loss functions and “vertical” likelihoods. This second paper extends the analysis to other ways of comparing learning algorithms. In particular, it considers averages over other quantitics besides targets, and nonhomogeneous loss functions. This reveals more senses in which all algorithms are identical. But it also reveals some important senses in which there I I W ”assumption-free” a priori distinctions between learning algorithms. Section 2 of this second paper reviews the mathematical formalism used in these papers. Section 3 reviews the “no free lunch” (NFL) theorems presented in paper one. It is then pointed out that the equivalence of expected errors between learning algorithms stipulated by those theorems does not mean algorithms have equivalent “head-to-head” minimax properties. Indeed, it may be that (for example) cross-validation is head-to-head minimax superior to ”anti-cross-validation” (the rule saying choose between genf error). If so, then eralizers based on which has the i ( w ~ cross-validation in that particular sense cross-validation could be a priori justified as an alternative to anti-cross-validation. Of course, the analysis u p to this point in these payers does not rule out the possibility that there are targets for which one particular learning strategy works well compared to another one. To address (the nontrivial ue, Section 4 of this second paper discusses the case where one averages over hypotheses rather than targets. The results of such analyses hold for all possible priors over targets, since they hold for all (fixed) targets. This allows them to be used to prove, as a particular example, that cross-validation cannot be justified as a Bayesian procedure, i.e., there is no prior for which, witlmrt r q a r d for tlw learnii?g i i / g ~ r i f l i i i i sin qwsfioii, one can conclude that one should choose between those algorithms based on minimal rather than (for example) maximal
Existence of Distinctions Between Learning Algorithms
1393
cross-validation error. In addition, it is noted that for a very natural restriction of the class of learning algorithms, one can distinguish between using minimal rather than maximal cross-validation error-and the result is that one should use maximal error (!). All of the analysis up to this point assumes the loss function is in the same class as the zero-one loss function (which is assumed in most of computational learning theory). Section 5 discusses other loss functions. In particular, the quadratic loss function modifies the preceding results considerably; for that loss function, there are algorithms that are a priori superior to other algorithms under averaging over targets. However it is shown here that even for such loss functions no algorithm is superior to its “randomized“ version, and in this sense one cannot a priori justify any particular learning algorithm, even for a quadratic loss function. Finally, Section 6 presents some open issues and future work. 2 The Extended Bayesian Formalism
These papers use the extended Bayesian formalism (EBF) (Wolpert 1992, 1994a; Wolpert ef al. 1995). In the current context, the EBF is just conventional probability theory, applied to the case where one has a different random variable for the hypothesis output by the learning algorithm and for the target relationship. It is this crucial distinction that separates the EBF from conventional Bayesian analysis, and that allows the EBF (unlike conventional Bayesian analysis) to subsume all other major mathematical treatments of supervised learning like computational learning theory, sampling theory statistics, etc. (see Wolpert 1994a). This section presents a synopsis of the EBF. Points (2), (8), (14), and (15) below can be skipped in a first reading. A quick reference of this section’s synopsis can be found in Table 1. Readers unsure of any aspects of this synopsis, and in particular unsure of any of the formal basis of the EBF or justifications for any of its (sometimes implicit) assumptions, are directed to the detailed exposition of the EBF in Appendix A of the first paper. 2.1 Overview.
1. The input and output spaces are X and Y, respectively. They contain n and Y elements respectively. A generic element of X is indicated by x, and a generic element of Y is indicated by y. 2. Random variables are indicated using capital letters. Associated instantiations of a random variable are indicated using lower case letters. Note though that some quantities (e.g., the space X) are neither random variables nor instantiations of random variables, and therefore their written case carries no significance. Only rarely will it be necessary to refer to a random variable rather than an instantiation of it. In accord with standard statistics no-
David H. Wolpert
1394
Table 1: Summary of the Terms in the EBF The sets X and Y, of sizes n and
Y
The input and output space, respectively
The set d, of m X-Y pairs The training set The X-conditioned distribution over The target, used to generate test sets
Y.f
The X-conditioned distribution over The hypothesis, used to guess for test Y.h sets The real number c The cost The X-value q The Y-value YF The Y-value YH
The test set point The sample of the targetf at point q The sample of the hypothesis / I at point 9 The learning algorithm The posterior The likelihood The prior
If c = L(YF.YH), L ( . , .) is the "lossfunction" L is "homogenmis" if
&,, h[c,L(yH.YF)] is independent of YH
If we restrict attention tofs given by a fixed noise process superimposed on an underlying single-valued function from X to Y, 4, and if ErnP(YF1 9.4) is independent of YF, we have "hornogeneozis" noise
tation, E(A I b ) will be used to mean the expectation value of A given B = b, i.e., to mean J d a a P ( a 1 b). (Sums replace integrals if appropriate.)
3. The primary random variables are the hypothesis X-Y relationship output by the learning algorithm (indicated by H ) , the target (i.e., "true") X-Y relationship ( F ) , the training set ( D ) ,and the real world cost (C). These variables are related to one another through other random variables representing the (test set) input space value (Q), and the associated target and hypothesis Y-values, YF and Y H , respectively (with instantiations YF and YH, respectively). This completes the list of random variables. As an example of the relationship between these random variables and supervised learning, f, a particular instantiation of a target, could refer to a "teacher" neural net together with superimposed noise. This
Existence of Distinctions Between Learning Algorithms
1395
noise-corrupted neural net generates the training set d. The hypothesis h on the other hand could be the neural net made by one’s “student” algorithm after training on d. Then q would be an input element of the test set, YF and YH associated samples of the outputs of the two neural nets for that element (the sampling of YF including the effects of the superimposed noise), and c the resultant “cost” [e.g., c could be (yF-yH)2]. 2.2 Training Sets and Targets.
4. rn is the number of elements in the (ordered) training set d. {dx(i). dy(i)} is the set of rn input and output values in d. rn’ is the number of distinct values in d x . 5. Targetsf are always assumed to be of the form of X-conditioned distributions over Y, indicated by the real-valued function f(x E X.y E Y) [i.e., P(yF I f.9) = f ( q . y F ) ] . Equivalently, where S, is defined as the r-dimensional unit simplex, targets can be viewed as mappingsf: X -+ S,. Any restrictions on f are imposed by P ( f >k . d , c ) , and in particular’ by its marginalization, P(f). Note that any output noise process is automatically reflected in P(yr I f.4). Note also that the equality P(yF 1 f.9) = f ( 9 . p ) refers only directly to the generation of test set elements; in general, training set elements can be generated from targets in a different manner.
6. The “likelihood” is P ( d I f ) . It says how d was generated fromf. It is “vertical” if P(d I f ) is independent of the valuesf(x. YF) for those x $ dx. As an example, the conventional IID likelihood is
[where ~ ( x is) the ”sampling distribution”]. In other words, under this likelihood d is created by repeatedly and independently choosing an input value d x ( i ) by sampling ~ ( x )and , then choosing an associated output value by sampling f[dx(i), .I, the same distribution used to generate test set outputs. This likelihood is vertical. As another example, if there is noise in generating training set X values but none for test set X values, then we usually do not have a vertical P ( d 1 f ) . (This is because, formally speaking, f directly governs the generation of test sets, not training sets; see Appendix A.) 7. The “posterior” usually means P(f I d ) , and the ”prior” usually means P(f). 8. It will be convenient at times to restrict attention to f s that are constructed by adding noise to a single-valued function from X to Y, 4. For a fixed noise process, such f s are indexed by the underlying 4.
David H. Wolpert
1396
The noise process is “homogeneous” if the sum over all
( J
of P(y,
1
q. oj is independent of y r . An example of a homogeneous noise process classification noise that with probability 17 replaces “(4) with some other value in Y, where that ”other value in Y” is chosen uniformly and randomly. IS
2.3 The Learning Algorithm.
9. Hypotheses I 1 are always assumed t o be of the form of X-conditioned distributions over Y, indicated by the real-valued function liis t X. y E Y ) [i.e., P(yt, 1 11. q ) = h ( q .! / H I]. Equi\,alently, where S, is d t fined as the r-dimensional unit simplex, hypotheses can be viewed as mappings I r : X S,.
-
Any restrictions on 11 are imposed by Pif. Ii. d. c ) . Here and throughout, a “single-\valued” distribution is one that, for a given x, is a delta tunction about some y. Such a distribution is a single-valued function from X to Y. As an example, if one is using a neural net for one’s regression through the training set, usually the (neural net) / I is single-valued. On the other hand, when one is performing probabilistic classification (as in softmax), / I is not single-valued.
10. Any (!) learning algorithm (aka “generalizer”) is given by P ( h I i f ) , although writing down a learning algorithm’s P(1i I rl) explicitly is often quite difficult. A learning algorithm is “deterministic” if the same t f always gives the same I f . Backprop with a random initial weight is not deterministic. Nearest neighbor is. Note that since d is ordered, ”on-line” learning algorithms are subsumed as a special case.
21. The learning algorithm only sees the training set d, and in particular P ( h 1 d), which does not directly see the target. So P ( h 1 f . r l ) means that P(I1.f 1 d ) = P(Ii 1 i f ) x P ( f 1 d ) , and therefore P ( f 1 1 t . d ) = P1h.f 1 d)’P(II 1 d ) := P ( j 1 d ) . ~
2.4 The Cost and “Generalization Error”. 12. For the purposes of this paper, the cost c is associated with a particular !/!! and !/{, and is given by a loss function L(yk+.yl). As an example, in regression, often we have “quadratic loss”: L ( y f f y. l ) = (!/tt
1’.
- .l/!
Li’. , ) is ”homogeneous” if the sum over yf- of h[c. L ( y , , . p ) ] is some function .I([), independent of y!+ (6 here being the Kronecker delta function). As an example, the “zero-one” loss traditional in computational learning theory [ L ( a .b ) = 1 if o # [ I , 0 otherwise] is homogeneous. 13. In the case of ”IID error” (the conventional error measure), P(9 I t f ) = ” ( 4 ) (so test set inputs are chosen according to the same distribution that determines training set inputs). In the case of OTS
Existence of Distinctions Between Learning Algorithms error, P(q I d ) = [b(q $ d d 4 q ) 1 / [ C q S ( q4 dx).(q)], z is true, 0 otherwise.
1397
where 6(2)
1 if
Subscripts OTS or IID on c correspond to using those respective kinds of error. 14. The “generalization error function” used in much of supervised learning is given by c’ = E(C h . d ) . (Subscripts OTS or IID on c’ correspond to using those respective ways to generate 4.) It is the average over all q of the cost c, for a given target f, hypothesis k, and training set d.
If.
In general, probability distributions over c‘ do not by themselves determine those over c or vice versa, i.e., there is not an injection between such distributions. However the results in this paper in general hold for both c and c’, although they will only be presented for c. In addition, especially when relating results in this paper to theorems in the literature, sometimes results for c’ will implicitly be meant even when the text still refers to c. (The context will make this clear.)
15. When the size of X, n, is much greater than the size of the training set, rn, probability distributions over cilo and distributions over cLTs become identical. (Although, as mentioned in the previous paper, distributions conditioned on ciID can be drastically different from those conditioned on cLTs.) This is established formally in Appendix B of the first paper. 3 The No-Free-Lunch Theorems and “Head to H e a d Minimax Dis-
tinctions
The goal is to address the issue of how F 1 , the set of targetsf for which algorithm A outperforms algorithm B, compares to FZ, the set off for which the reverse is true. To analyze this issue, the simple trick is used of comparing the average over f of f-conditioned probability distributions for algorithm A to the same average for algorithm B. The relationship between those averages is then used to compare F1 to FZ. Here and throughout this paper, when discussing non-single-valued fs, ” A ( f ) uniformly averaged over all f” means J d f A ( f ) / [ d f l . Note that these integrals are implicitly restricted to those f that constitute X conditioned distributions over Y, i.e., to the appropriate product space of unit-simplices. (The details of this restricting will not matter, because integrals will almost never need to be evaluated. But formally, integrals over f are over a full r”-dimensional Euclidean space, with a series of Dirac delta functions and Heaviside functions enforcing the restriction to the Cartesian product of simplices.) Similar meanings for “uniformly averaged” are assumed if we are talking about averaging over other quantities, like 4 or P($I).
1398
David H. Wolpert
3.1 The NFL Theorems. In Wolpert (1992), it is shown that P(c I d ) = ,j’dfdhP(h 1 d ) P ( f 1 d)M,,,i(f.h),where so long as the loss function is symmetric in its arguments, M<.,,,(.. .) is symmetric in its arguments. [See point (11)of the previous section.] In other words, for the most common kinds of loss functions (zero-one, quadratic, etc.), the probability of a particular cost is determined by an inner product between your learning algorithm and the posterior probability. [f and h being the component labels of the d-indexed infinite-dimensional vectors P(f I d ) and P ( h I d), respectively.] Metaphorically speaking, how “aligned” you (the learning algorithm) are with the universe (the posterior) determines how well you will generalize. The question arises though of how much can be said concerning a particular learning algorithm’s generalization behavior without specifying the posterior (which usually means without specifying the prior). In paper one the following ”no-free-lunch” (NFL) theorems are derived to partly address this issue. These theorems are valid for any learning algorithm. See paper one for a full discussion of the implications of these theorems.
Theorem 1. For homogeneous loss L, the uniform nuerage over a l l f o f P ( c I f.d ) equals .\(c)/r. Theorem 2. For OTS error, a zwtical P ( d I f ) , and a hornogeizeous loss L, the iiiizforni azvrage over all targets f o f P ( c 1 f.m )= .\(c)/r. Theorem 3. For OTS error, a oerticnl P ( d I f),uniform P(f), and a homogeneous loss L, P ( c 1 d ) = . l ( c ) / r . Corollary 1. For OTS error, a z1ertical P ( d 1 f),uniform P(f), and n homogeizeaiis loss L , P(c 1 in) = . l ( c ) / r . Theorem 4. For homogeneous loss L and a hornogeneous test-set noise process, the uniform average ozier all single-zlaliied targetfunctions (zofP(c 1 o. d ) equals .\(c ) / r . Theorem 5. For OTS error, a vertical P ( d 1 o), hoinogencous loss L, and a lzoiizogeneoiis test-set noise process, the uniforni nuerage o71er all target functions o ofPic I 0.m ) equals L l ( c ) / r . Theorem 6. For OTS error, a ztertical P ( d 1 o), homogeneous loss L, uniform P ( o ) ,and n homogeneous test-set noise process, P( c j d ) equals .I(c ) / r . Corollary 2. For OTS error, iwticnl P ( d 1 o),homogeneous loss L, a homogeneous test-set noise process, and uniform P(o),P ( c I m )equals .\(c)/r. Theorem 7. Assume OTS error, a zwrtical P ( d I o ) ,hornogeneous loss L, a n d a homogeneous test-set noise procrss. Let o index the priors P( 0 ) . Then the uniform a z w q e over all I t o f P ( c 1 m. c t ) equals .\(c)/r.
Existence of Distinctions Between Learning Algorithms
1399
Theorem 8. Assume OTS error, a vertical P(d I Q1), homogeneous loss L, and a homogeneous test-set noise process. Let (Y index the priors P ( 4 ) . Then the uniform average over all a of P(c I d . a ) equals h ( c ) / r . Corollary 3. Assume OTS error, a vertical P(d I @), homogeneous loss L, and a homogeneous test-set noise process. Let cy index the priors P ( # ) , and let G(a)be a distribution over a. Assume G ( a )is invariant under the transformation of the priors a induced by relabeling the targets 4. Then the average according to G ( o ) of P(c I m , a ) equals A(c)/r. Now define the empirical error
As an example, for zero-one loss and single-valued h, s is the average misclassification rate of h over the training set. The 7r[dx(i)]terms in this definition can be replaced by a constant with no effect on the analysis below. Theorem 9. For homogeneous L, OTS error, a verticaI likelihood, and uniform P(f ) , P(c I s. d ) = A(c)/r. For the purposes of the next result, restrict things so that both hypotheses h and targets are single-valued (and therefore targets are written as functions (h), and there is no noise. Y is binary, and we have zero-one loss. Let the learning algorithm always guess the all 0s function, h*. The “punt signal” is given if d y contains at least one non-zero value. (That signal is supposed to indicate that one is unsure about using the fit A*.) Otherwise a ”no-punt” signal is given. Then for the likelihood of (2.1), uniform 7r(x),and n >> m, Theorem 10. For the h’ learning algorithm, for all targets 4 suck that $(x) = 0 for more than m distinct x, E(COT~I 4. punt. m ) I E ( C O T I~4. no punt, m). Note that for n >> m, almost all 4 meet the stipulated condition. Taken together, these results severely restrict how well an algorithm can be said to perform without knowledge of the prior over targets P( f ) . In particular, they say that any learning algorithm can just as readily perform worse than randomly as better. See paper one for a full discussion of this and other implications of these results. 3.2 An Example. Some examples of the NFL theorems are presented in paper one. Another one, particularly relevant for the discussion in this second paper, is as follows. Example 5: Return to the scenario of Example 1 from paper one: We have no noise (so targets are single-valued functions 4), and the zero-one loss L(.. .). Fix two possible (single-valued) hypotheses, hl and h2. Let
David H. Wolpert
1400
learning algorithm A take in the training set d, and guess whichever of hl and h2 agrees with d more often (the “majority” algorithm). Let algorithm B guess whichever of hl and h2 agrees less often with d (the “antimajority” algorithm). If hl and k2 agree equally often with d, both algorithms choose randomly between them. Then averaged over all target functions 4,E(C I 4.m ) is the same for A and B. Consider the case where & = hl, and for simplicity have Iil(.u) # h 2 ( x ) for all x. Then with some abuse of notation, we can write P(cA.C B I d). m ) = n ‘ [ ( c A , c B ) . (0. l)],or P C A , C R ~ a , ~ (1 O .I 4. m ) = 1 for short, where c; refers to the cost associated with algorithm i and ”h(a. b)” is the Kronecker delta function. In other words, algorithm A, the majority algorithm, will always guess perfectly for this target, whereas algorithm B, the antimajority algorithm, will never get a guess correct. Now L(.. .) takes on only two values in this example. So by the arP(cA. C B I 4). m ) = gument in Section 4.4 of paper one, we know that C4P(cB.CA I &. m). [This is a stronger statement than the generic NFL theorems, which only say that P(CAIGL m ) = Z@ P(CB I &. m ) . ] So the probability ”mass” over all (D for having (cA. C B ) = (n, ;1) is equal to the mass for having ( C A . c B ) = ( I j . f i ) . Stated differently, if & P(cB. CA 1 @. m ) is viewed as a function over the R2 space of values of ( C A . c B ) , that function is symmetric under CA cB. This might suggest that all aspects of algorithms A and B are identical, if one considers the space of all possiblef. This is not the case, however. In particular, the CA * C B symmetry does not combine with the fact that there is a q4 for which P C ~ , C ~ , @ , M (1O .1 &. m ) = 1 to imply that there must be a $ for which PC-,C~IQ,,M(~. 0 I 6.nz) = 1. Despite the fact that P c ~ . c ~ I ~ . M 1 (I Oh.l , m ) = ‘1, X4P c ~ , . c ~ ~ ~1, M 1 4. ( Om. ) can equal & PCI.CHIa,M(l. 0 I d, m ) by having many Q for which PC,.CRIa,M(l. 0 1 rl,. m ) is greater than 0, but none for which it equals 1 exactly. More generally, consider any 9 other than Q = hl or d, = 112. For such a 4,there exists at least one x where &(x) # Izl(x), and at least one x where 4 ( x ) # h2(x). Then so long as ~ ( x>) 0 for all x, for any such & there exists a training set d for which P ( d I 4.m) > 0 and such that both h l ( . ) and h 2 ( . )have disagreements with 4 over the set of elements in X - d x . [Just choose d = { d x , &(d,)} and choose d x to not include all of the elements x for which 4 ( x ) # hl(x) and to not include all of the elements x for which
x@
x4
-
dx) # W).I
Therefore for any such 4,for either the majority or antimajority algorithm, €(Cc,rs I 4. m ) = CdE(CoTs I d ) P ( d I 4, m ) > 0. Therefore in particular, for any such $, the antimajority algorithm has €(COTS/ d. m ) > 0. Given that it is certainly true that E(CoTs I 4%m ) > 0 for all other Q (those that equal either hl or hz), this means that for the antimajority algorithm, for no 0,is it true that E(CoTs 1 Q , m ) = 0. Yet on the other hand, we know that for the majority algorithm, there are 4 such that E(Cors 1 $,I?z) = 0. So in this scenario, despite the NFL theorems, there exist 4) for which we expect CA to be far less than CB (e.g., for & = hl the difference in I$,
Existence of Distinctions Between Learning Algorithms
1401
expected costs is l),but none for which the reverse is true. So for no 41 will you go far wrong in picking algorithm A (rather than picking B), and for some 4 you would go far wrong in picking algorithm B. In such a case, in this particular sense, one can say that algorithm A is superior to algorithm B, even without making any assumptions concerning targets.
3.3 The NFL Theorems, Cross-Validation, and Head-to-Head Minimax Behavior. In paper one it is pointed out that for some pairs of algorithms the NFL theorems may be met by having comparatively many targets in which algorithm A is just slightly worse than algorithm B, and comparatively few targets in which algorithm A beats algorithm B by a lot. This point is also discussed in example (5) just above. When we have two algorithms of this sort, we say that A is “head-to-head’’ minimax superior to B.’ It is interesting to speculate about the possible implications of head-tohead minimax superiority for cross-validation. Consider two algorithms (1 and /3. (Y is identical to some algorithm A, and / j works by using crossvalidation to choose between A and some other algorithm B. By the NFL theorems, N and /3 must have the same expected cost, on average. However the following might hold for many choices of A, B, the sampling distribution ~ ( x )etc.: , For most targets [i.e., mostf, or most 4, or most P( d ~ ) depending , on which of the NFL theorem’s averages is being examined] A and B have approximately the same expected OTS error, but /j usually chooses the worse of the two, so in such situations the expected cost of r’3 is (slightly) worse than that of o. In those comparatively few situations where A and B have significantly different expected OTS error, ,I; might correctly choose between them, so the expected cost of Lj is significantly better than that of o. for such situations. In other words, it might commonly be the case that when asked to choose between two generalizers A and B in a situation where they have comparable expected cost, cross-validation usually fails, but in those situations where the generalizers have significantly different costs, cross-validation successfully chooses the better of the two. Similar behavior may hold even when using cross-validation to choose among more than two algorithms. Under such a hypothesis, crossvalidation still has the same average OTS behavior as any other algorithm. And there are actually more situations (fs)in which it fails than in which it succeeds (!). However under this hypothesis cross-validation ’To simply say that A is minimax superior to B without the “head-to-head” modifier would imply instead something like maxf E(C I f.n1.A) 5 maxi E(C 1 f . m, B ) . Such minimax superiority is of little interest. Indeed, by the argument just below Lemma (1) in paper one, we know that for the random learning algorithm E(C 1 f . nz) = C,.A(c)c/Y for all f,so that maxf E(C If. rn) = minf E(C 1 f.tn) for that algorithm. Using the NFL theorems, this means that the random learning algorithm is minimax superior to all other learning algorithms.
David H. Wolpert
1102
has desirable head-to-head minimax beha\,ior; its behavior will never be much worse than the best possible.' So in particular, assume we are in such a scenario, and assume further that whatever f is, the generalizers among which we are choosing all perform well for that f , i.e., we believe our generalizers are well-suited to the target f , although we perhaps cannot formalize this in terms of priors over f, etc. Then we are assured that cross-validation will not perform much worse than the best of those generalizers-all of which perform well for the f at hand-and may perform significantly better. (It should be noted though that even if this is the case, there are still a number of caveats concerning the use of cross-validation, caveats that, unlike the NFL theorems but just like head-to-head minimax behavior, apply regardless of f . See the averaging-o\.er-hypotheses section below.) Note that the desirable head-to-head minimax behavior of this scenario would be prior-independent; if it holds for all targets f, then it certainly holds for all P i - f i . In addition, such behavior would be consistent with the (cctnyentional) view that cross-validation is usually an xcurate estimator of generalization performance. It is important to note though that one can explicitly construct cases where cross-validation does not ha\,e this desirable head-to-head minimax behavior. (See Section 8 in Wolpert 1993a.) Moreover, as discussed in Section 5.2 of paper one, there are cases where generalization error and cross-validation error are statistically independent. In addition, it is not at all clear why one should pay attention to distributions conditioned on training set size iii rather than on the actual training set 11 at hand. Yet it is only distributions conditioned on 111 rather than d that can be said to have head-to-head minimax distinctions (of any sort) with regard to targetsf. [To see this, for simplicity consider 1 ' deterministic learning algorithm. For such an algorithm, P ( c 1 f . d ) = +'c. \ ( f.t i ) ]for some function \ ( .. . ) . The immediate corollary of Theorem ( I ) is that for such an algorithm, for an\- iixed training set t f and cost c, the number o f targetsf resulting in that cost is a constant, independent of thcl learning algorithm.] It should also be noted that in this hypothesis concerning cross-valiclation's head-to-head minimax properties, cross-validation constitutes a different learning algorithm for each new set of generalizers it chooses ~~
'It is important t o kcrp i n mind though th,it even it he'id-to-head minimax behavior docs cnd up playing a priori f'ivorites brt\veen ,algorithms, it is not clear tvh); one 4iould a r t ' about such behavior rather than sinaplr expected cost (which by the NFL tlirurcnis p1ays iio such favorites). Onc intriguing possible answer to this question is t h a t in choosing betiveeii species A and spccies B (and their associated orgmic learning dgorithms), natural selection may use head-to-head minim'i\ behavior. The idea is that tor those targethj. fcir which A s behavior is just slightly preferred over B's, equilibrium has R slightly smaller population tor species A than tor species B. But if there is an); nonlinearit! in the system, then tor any f s tor x\.hicli [Ys behavior is tar worse than Ws, A goes extinct Therefore i f o\'er time the environment presents A and B with 1' stme5 of i s , the surx,i\.ing sprcies k r i l l be the one with prc+crable head-to-head minimax bch
Existence of Distinctions Between Learning Algorithms
1403
among. So even if the hypothesis is true, it would not mean that there is a single learning algorithm that is head-to-head minimax superior to all others. Rather it would mean that given any (presumably not too large) set of learning algorithms, one can construct a new one that is head-to-head minimax superior to the algorithms in that set. 3.4 More General Issues Concerning Head-to-Head Minimax Behavior. The discussion above leads to the question of whether one can construct a fixed learning algorithm that is head-to-head minimax superior to all others. (As a possible alternative, it may be that there are "cycles," in which algorithm a is minimax superior to P, and B is to k, but in addition ,y is superior to 0.) It seems unlikely that the answer to this question is yes. After all, due to the NFL theorems, the smallest maxf E(C 1 f . m ) can be for our candidate learning algorithm is C,cA(c)/r. However for all targetsf, there is at least one learning algorithm that has zero error on that f (the algorithm that guesses that f regardless of the data set). So for anyf there is always an algorithm that outperforms our candidate algorithm on that f by the amount C,c h ( c ) / r . It is hard to reconcile these facts with our hypothesis, especially given that the C,is usually quite large (e.g., for zero-one loss, it is ( Y - l)/u). It may be that if one restricts attention to "reasonable" algorithms (presumably defined to exclude the always-guess-f-regardless-of-dalgorithm, among others), there is such a thing as an algorithm that is head-to-head minimax superior to all others. This raises the following interesting issue: All learning algorithms that people use are very similar to one another. To give a simple example, almost all such algorithms try to fit the data, but also restrict the "complexity" of the hypothesis somehow. So the picture that emerges is that people use learning algorithms tightly clustered in a tiny section of the space of all algorithms. It may be that there is an algorithm that is minimax superior to all those in the tiny cluster, even if there is no such thing as a universally head-to-head minimax superior algorithm. If that is the case, then it would be possible to construct an algorithm "superior" to those currently known. But this would be possible only because people have had such limited imagination in constructing learning algorithms. There are a number of other interesting hypotheses related to minimax behavior. For example, it may be that in many situations not only is cross-validation head-to-head minimax superior to the generalizers it is choosing among, but also superior to anti-cross-validation. As another hypothesis, say we have two algorithms A and B where A is head-to-head minimax superior to B. So E(C I f . rn,B) "flares out" above E(C I f ,nz. A ) for somef. Now change the loss function to be nonhomogeneous, in such a way that what used to be large values of c become much larger whereas other values barely change. Given the "flaring out" behavior, this may lead to A's being superior to B even in terms of E(C 1 f . m ) [since a t those "flare"f, the difference between E(C I f . m. A ) and E ( C 1 f . m. B ) has
cf
David H. Wolpert
1404
been increased, and there is little change in the difference for those f for which E(C 1 f . m . B ) < € ( C f.nz.A)I. In fact, for nonhomogeneous loss functions there are distinctions between algorithms in terms of If€( C 1 f . ni), as is discussed below. The line of reasoning just given builds upon this fact: it suggests that when there are head-to-head minimax differences between algorithms A and B for some homogeneous loss function L, then there are likely to be differences between the raw averages I, E(C I f . m. A ) and Cf E(C I f . m. B ) for an associated nonhomogeneous loss function L’.
I
4 Averaging over Generalizers Rather Than Targets
~
In all of the discussion up to this point, we have averaged over entities concerning targets [namelyf , 0,or P(o)]while leaving entities concerning the hypothesis [e.g., the generalizer P ( h 1 d ) ] fixed. Although the results of such an analysis are illuminating, it would be nice to have alternative results in which one does not have to specify a distribution over f / @ / P ( o ) . Such results would be prior-independent. One way to do this is to hold one o f f/o/P(0)fixed (though arbitrary), and average over entities concerning the hypothesis instead. It is not clear if one can formulate this approach in so sweeping a manner as the NFL theorems, but even its less sweeping formulation results in some interesting conclusions. In that they are prior-independent “negative” conclusions, these conclusions constitute a first principles proof that, for example, cross-validation is non-Bayesian, i.e., they constitute a proof that there is nof/o/P( 0 ) for which cross-validation will necessarily work, independent of the learning algorithms it is choosing among.
4.1 Averages over Entities Concerning the Hypothesis. Consider the following situation: the target f is fixed, as is the training set d. There are two hypothesis functions, hl and hZ. For simplicity, I will restrict attention to the (most popular) case where 11s are single-valued, so expressions like “ b ( x ) ” are well-defined. Consider two strategies for choosing one of the h,: (A) choose whichever of the h, agrees more often with d , with random tie-breaking; (€3) choose whichever of the It, agrees less often with d , with random tie-breaking. Note that if the two hs agree with the training set d equally as often, the strategies are equivalent. To analyze the relative merits of strategies A and B, start with the following lemma, proven in Appendix A. Lemma 3. For all targets f , for all m’ < n, for all training sets d having rn’ distinct elements, and for all sets of in’ values f o r the ualiies of tke hypothesis on the elenzents of d,, h ( x E d x ) ,
Existence of Distinctions Between Learning Algorithms
1405
is the samefunction of the cost c,for an OTS homogeneous loss that is a symmetric function of its arguments.
xlll,l,2
Now consider the quantity P ( c I f . hl. h2. d, i ) ,where the variable i is either A or B, depending on which strategy we are using. This quantity tells us whether A or B is preferable, if we uniformly average over all pairs of hs one might use with A dnd/or B. (Nonuniform averages are considered below.) As such, it is a measure of whether A is preferable to B or vice versa. In Appendix A the following result is derived using Lemma 3: Theorem 11. For an OTS homogeneous symmetric error, ~ l l , . l l zP(c A ) = XII, P ( c hl. h2. d. B ) , for a l l f and d.
If.
If.
hl h2. d .
So even if the likelihood is not vertical, for anyf and d, averaged over all hypotheses hl and h2, strategy A equals strategy B. The same reasoning can be used to equate cross-validation with anticross-validation. Let G1 and G2 be two learning algorithms, and let strategies A and B now refer to cross-validation and anti-cross-validation, respectively. Since we are only allowing single-valued hs, and there are r” such hs, any generalizer is a mapping from a training set d to a point on the r”-dimensional unit simplex. Therefore any generalizer is a point in the A-fold Cartesian product of such simplices, where A is the number of possible training sets. (A is set by n, r, and the range of allowed m and m’.) So we can talk about averaging over generalizers. Consider uniformly averaging P(c I f.G I ,GP.d. i ) over all G1 and G2. Since the training set d is fixed, this corresponds to first uniformly averaging over all possible hypotheses hl and h2 constructed from d , and then forming an independent average over all possibilities of whether it is the hypothesis labeled hl or the one labeled h2 that corresponds to the generalizer with lower cross-validation error. (That ”independent average” is set by the behavior of the GI over the proper subsets of d that determine cross-validation error over d . ) Consider just the inner-most of these two sums, i.e., without loss of generality say that it is hl that corresponds to strategy A and h2 that corresponds to strategy B. Then using the reasoning behind Theorem (ll),we deduce the following. Theorem 12. For an OTS homogeneous symmetric error, the unform average over all generalizers G1,G2 of P(c I f . G1,G2,d , cross-validation) equals the uniform average over all G1 and G2 of P(c G1,G2, d . anti-cross-validation).
If,
So for anyf, averaged over all generalizers, cross-validation performs the same as anti-cross-validation. In this sense, if one does not restrict the set of generalizers under consideration, then regardless of the target, crossvalidation is no better than anti-cross-validation. The immediate implication is that without such a restriction, there is no prior P(f)that justifies (in the manner considered in this section) the use of cross-validation rather than anti-cross-validation. In this sense, cross-validation cannot
1406
David H. Wolpert
be justified as a Bayesian procedure. Without restrictions on the set of generalizers under consideration, one cannot say something like ”crossvalidation is more appropriate for choosing among the generalizers than is anti-cross-validation if one assumes such-and-such a prior.” All of this holds even if we are considering more than two /is at a time [in the case of Theorem (ll)],or more than two generalizers at a time [in the case of Theorem (12)]. In addition, Theorem (12) holds if we are using any d-determined strategy for choosing between algorithms-nothing in the proof of Theorem (12) used any property of cross-validation other than the fact that it uses only d and the G, to decide between the G,. A similar result obtains if we simply average over generalizers G without any concern for strategies. With some abuse of notation, we can write such an average as proportional to
where “ d [ P ( hI d ) ] ” means the average oirer the r”-dimensional simplex that corresponds to the learning algorithms’ possible generalizations from training set d. We can expand our integral as
By symmetry, this is proportional to I,, P ( c 1 f . 11. 1 1 ) . We therefore have the following corollary of Lemma (3): Corollary 4. For OTS crror, the iiizjfurm nivrage ozw nll gmernlizcrs G @ P ( c f . d . G )is n fixedfunction ofc, iirdejwmimt ( f f m i i f i f .
1
[Of course, this in turn results in an identical result for the distribution Pic /f.nz.G).] This has some rather peculiar implications. It means, for example, that for fixed targets f i and fl, if f i results in better generalization with the learning algorithms in some set S, thenf? must result in better generalization with all algorithms not in S . In particular, if for some favorite learning algorithm(s) a certain ”well-beha\,ed,” “reasonable” f results in better generalization than does a “random”f,then that favorite f results in xwse t h m rmdorir behavior for all remaining algorithms. For example, let f be a straight line, and S any set of algorithms that generalizes well for that target. Then the algorithms not in S have zuorst’ fliari random generalization on that f . 4.2 Implications of Theorems (11) and (12). It is worth spending a moment to carefully delineate what has been shown here. Theorems (11) and (12) mean that there is no prior P ( f ) such that, without regard to 12, and 117 [GI and G? in the case of Theorem (12)], strategy A is preferable to strategy B, for either of the two sets of strategies considered in those
Existence of Distinctions Between Learning Algorithms
1407
theorems. [Note how much stronger this statement is then saying that averaged over all P(f),two strategies are equal.] At best, strategy A beats strategy B for certain P ( f ) and certain associated hl and hZ. Exactly which hl and h2 result in the superiority of strategy A will change with
P(f). So a technique like cross-validation cannot be justified by making assumptions only concerning P(f),nor by making assumptions only concerning GI and Gz. Rather one must make assumptions about how G1 and G2 correspond to P ( f ) . It is the interplay between all three quantities that determines how well cross-validation performs. [See Theorem (1) of Section ( 3 ) of Wolpert 1994a and the inner product discussion at the beginning of Section 3.1 of this paper.] Since Theorems (11) and (12) hold for all training sets d, they mean in particular that iff is fixed while d is averaged over, then for either the pair of strategies considered in Theorem (11)or that of Theorem (12), the two strategies have equal average OTS error. This is important because such a scenario of having f fixed and averaging over ds is exactly where you might presume (due to computational learning theory results) that strategy A beats strategy 8. How can we reconcile the results of this section with those computational learning theory results? Consider the case of Theorem (11). My suspicion is that the following is happening: There are relatively few { h l . h2) pairs for which strategy A wins, but for those pairs, it wins by a lot. There are many pairs for which strategy B wins, but for those pairs, strategy B only wins by a little. In particular, I suspect that those "many pairs" are the relatively many pairs for whichf agrees with hI almost exactly as often as it agrees with hZ. If this is indeed the case, it means that strategy A is unlikely to be grossly in error. Note that this is a confidence-interval statement, exactly the kind that the VC theorems apply to. (However, it is a confidenceinterval statement concerning off-training set error; in this it is actually an extension of VC results, which concern IID error.)
4.3 Restricted Averages over Quantities Involving the Hypothesis. Since Theorem (11)tells us that we must "match" hl and h2 tof in some sense for the majority algorithm (A) to beat antimajority, one might expect that if we sum only over those hypotheses hl and h2 lying close to the target f , then A beats B. Intuitively, if you have a pretty good idea of what f is, and restrict hs to be fairly good approximators of that f , then you might expect majority to beat antimajority. Interestingly, the exact opposite is usually true, if one measures how "close" an h is to f as €(CIIDI f,h ) = CYH,,,r.Y L[h(q).yF]f(q. Y F ) " ( ~ ) . The analysis of this is in Appendix B, where, in particular, the following results are derived.
David H. Wolpert
1408
Theorem 13. For OTS error, a ~ trairiiq y set d, and any single-ualzied target, Q, if tkere is no noise in the ‘qenerntiori of tkc training set, then f o r fkc majority and aiitirnnjority stratc@es A arid B,
C[€(C 1 k1.Irz.0.d.
B ) - E(C 1 k i . h : . ~ . d . A5) ]0
11: .11z
iidicrr7tkesuin isozlerallkl nridI~~suchtlinthotli E ( C [ ~ 1Do . 1 i 1 ) a n d E ( C 1110~. k 2 ) nre less thnrz 1. Theorem 14. For zero-onc OTS error, aiiy training set d arid any single-valued target, (1, if there is no noise in the generation of thc training set, then for the rrmjorify and antiiriajority strategies A and B,
x [ E ( C1 k l . h 2 . O . d . B ] 11,
-
E(C I k1.hz.(.J.d.A)] <0
-112
d i m f k e s i i m is oiler all k l aiid k2 sirck that both €( CIIDI o.kl ) and E ( C I ~ 1UQ. k 2 ) ore less tkan mnie c < 1.
As usual, since these results hold for any particular training set d , t h y also hold if one averages over [is generated from o. In addition, similar results hold if A and B refer to cross-validation and anti-crossvalidation rather than the majority strategy and the antimajority strategy. (Rather than sum over all k l and k z in a certain class, one sums over all generalizers GI and Gz that produce 11s that lie in that class.) The opposite results hold if we instead restrict the hypotheses k l and I f 2 to lie far from the target o, i.e., if we restrict attention to k s for which E(CI1[)1 o . k ) > :.? It is currently unknown what happens if we adopt other kinds of restrictions on the allowed hs. In particular, it is not clear what happens if one k must be in one region and the other k in a different region. Say we accept it as empirically true that using the majority algorithm does, in the real world, usually result in lower expected cost than using the antimajority algorithm, as far R S OTS zero-one loss is concerned. Say wt‘accept this IS true even in thuse cases wl~icrcwc use 11s tliat art;‘simildr, and therefore have similar goodness-of-fits to f . Given Theorems (13) and (14), this implies that there must be some rather subtle relationship between the f s we encounter in the real world on the one hand, and the pairs of k l and 112 that we use the majority algorithm with on the other. In particular, it cannot be that the only restriction on those pairs of kl and k? is that they lie in a sphere of radius :centered o n f . However it might be that certain ”reasonable” nonuniform distributions over the set ’In the terminology of Appendix 6, we still get 51 < 5 2 . In Appendix B this meant that the upper limit on hz‘s error over X ~- d x was less than the upper limit o n Irl’s error, rvhile they shared loiver limits. Here it instead means that the lower limit on 172’s error over X -- d~ is lower than the lower limit on hi's error, while they share the same upper 11 mi t .
Existence of Distinctions Between Learning Algorithms
1409
of allowed hs result in our empirical truth. Alternatively, there may be other geometric restrictions (besides a sphere) that allow kl to be close to k2, but that also allow our empirical truth. As one example, it is conceivable that our empirical truth is allowed if kl and kz do not fall in a sphere of radius E centered onf, but rather in a ellipse centered off off. 5 Nonhomogeneous Loss Functions 5.1 Overview. This section considers the case where L(., .) is not homogeneous, so that the NFL theorems do not apply, and therefore we usually can make a priori distinctions between algorithms. This section is not meant to be an exhaustive analysis of such loss functions, but rather to illustrate some of the properties and issues surrounding such functions. An example of such an L(.. .) is the quadratic loss function, L(a. b) = ( a - b)2 for finite Y. For the quadratic loss function (and in general for any convex loss function when Y is finite), everything else being equal, an algorithm whose guessed Y values lie away from the boundaries of Y is to be preferred over an algorithm that guesses near the boundaries, in that this will give lower error. In addition, consider using a quadratic loss function with two learning algorithms, Pl(k I d) and Pz(k I d). Construct the algorithms so that they are related in the following manner: algorithm 2 is a deterministic algorithm that makes the single-valued guess given by the expected Y H guessed by algorithm 1, in response to the training set d and test set question q at hand. [More formally,P2(k I d) = b(k - k*), where k*(q.y) = b [ y , E ~ ( yI ~q > 4 l = b(y,C,Jjy’Sdk’k’(q,y’)Pl(k’ I d ) ] ) , where k‘ is a dummy k argument and y’ a dummy y argument.] It is not hard to show (Wolpert 1994a)that for all d and q, Ez(C I q. d) El (C I d.9), i.e., the expected cost for algorithm 2 is always less than or equal to that of algorithm 1, regardless of properties of the target.4 This holds even for OTS error, so long as the loss function is quadratic. None of this means that the intuition behind the NFL theorems is faulty for nonhomogeneous L ( .:). However it does mean that that intuition now results in theorems that are not so sweeping as the NFL theorems. In essence, one must ”mod out” the possible a priori advantages of one algorithm over another that are due simply to the fact that 4Phrased differently, for a quadratic loss function, given a series of experiments and a set of deterministic learning algorithms G,, it is always preferable to use the average generalizer G’ i C,C,/ 1 for all such experiments rather than to randomly choose a new member of the G, to use for each experiment. Intuitively, this is because such an average reduces variance without changing bias. (See Perrone 1993 for a discussion of what is essentially the same phenomenon, in a neural net context.) [Or to put it another way, no matter what value a real-number ”truth” z has, if one has two real numbers (2 and [j, then it is always true that ( z - [o+ 0]/2)’ I (z- ~ ) ~+ /(z2- [j)’/2.] Note though that this effect in no way implies that using G’ for all the experiments is better than using any single particular G E {G,} for all the experiments.
Ci
Da\,id H. Wolpert
74-10
one or the other of those algorithms tends to produce y l l values that are favored a priori by the I * ( . . at hand. ‘The primary tool for deriving these less-sweeping results is to compare the OTS behavior of a learning algorithm P ( / I 1 1 1 ) to that of its “scrambled” or “randomized” version, in which P(!/if) is preserved, but the relationship between the training set d and the hypothesis Ii is randomly scrambled. Such comparisons show that all the a priori advantages conferred on a particular learning algorithm by a nonhomogeneous L( . _. are due simply to the clumping of P ( ! / [ , )in regions that, due to L( ,. . ) are ”good.” [An example of such a region is the region equidistant from the borders of Y, for a quadratic L i . . ‘i.1 In particular, according to these comparisons, there is no extra adlrantage conferred by how the learning algorithm chooses to associate !/]is with particular training sets d and test set inputs q, i.e., there is no advantage conferred by the d-dependence of the algorithm. This ”scrambling” tool can also be used to help analyze the case where nonhomogeneous noise-processes are superimposed on single-valued target functions (recall that some o f the NFL theorems require homogeneity ot such noise). For rt’asons of space though, such an analysis is not presented here. 5.2 The Equivalence of Any Learning Algorithm with Its Scrambled Version. For reasons ai space, I will consider only the case where all targets .f are allowed (rather than the case where all f s expressible as some I with noise superimposed are allowed). Since the focus is on how ( i f a t all) the statistical relationship between tf and / I embodied in the generalizer correlates ~ ‘ i t herror, we must allow the training set if to v a r y Accordingly, the results are conditioned on training set size i i i rather than on the precise training set tl at hand. The following thcorem, which holds regardless of the loss function, is proven in Appendix C . J
Theorem 15. For n/gorit/ziiis illifll tlic.
ilt’rt/Cll/
sri11if’
\ikdiliood, i r i i i f o i x r P ( f ) ,cliirl OTS ~ r r o rrill , /rflrizrfz,y i i i i i /iiZilc t / i t ~M ~ I I P . P 1( C1 1 1 ) .
~ ( ! / j j
Example 6: This example presents an explicit corroboration of Theorvm (15) in a special case. Say we h a w quadratic loss L(.:). In general, we will be interested only in the case where r > 2. (For binary Y, e.g., l’ { O . l}, quadratic loss is the same as zero-one loss, and we are right back in the setting where the NFL theorems apply.) Rather than consider P ( t - i i i i ~directly, consider the functional o f it, E ( C 1 i i i ) . For quadratic 1.:. . . i . E ( C l i i i ) = E[iI/,, - ! / J i’ m’.Expanding, we get ~
~
EIC j w ) = E ~ ( ! / J ~ Jj ’I I I ;
. Ei(yl ) 2
1111; -
2E(y,,yl 1 w )
E,i!/l j 7 I i i r ] is some constant, independent of the learning algorithm, dettv-mined by the geometry of Y, thc likelihood, etc. € [ ( I / l , ) 2 1 it11 does
Existence of Distinctions Between Learning Algorithms
1411
depend on the learning algorithm, and is determined uniquely by P(YHI m ) [in accord with Theorem (15)). The only term left to consider is the correlation term E(YHYFI m ) . Write this term as J d y ~ y p E ( yI ~yp.m)P(yF I m). Now for the assumptions going into Theorem (15), E(YH I y ~ , m = ) E(YH I Accordingly, there is no expected correlation between YH and YF: E ( y ~ y rI m ) = E ( ~ H I m )x E(yF 1 m). This means, in turn, that our correlation term depends only on the learning algorithm through E(YHI m), which in turn is uniquely determined by P(yH 1 m). Therefore the theorem is corroborated; E(C 1 m) depends on the learning algorithm only through P(YH I m).6 Now P ( c I m ) = c , P ( c I f . m ) P ( f I m ) . Therefore P ( c I m ) for a uniform P(f)is proportional to P(c 1 f , m), where the proportionality constant is independent of the learning algorithm. This establishes the following corollary of Theorem (15).
c,
Corollary 5. Consider two learning algorithms that uiould have the same P(YH1 m ) i f P( f ) were uniform. For a vertical likelihood and OTS error, the uniform average over f of P(c I f . m ) is the same for those two algorithms. Now make the following definition. Definition 1. The learning algorithm B is a “scrambled“version of the learning algorithm A ifand only iffor all d x , CdyP(k I d x . d y . B ) = EdyP ( k 1 d x . dy.A). 5To see this, write P(YH 1 Y F , ~ )= z d , q J d f d h h ( q , y H j P ( h 1 d j P ( d , q . f 1 Y F . ~ ) . Use Bayes’ theorem to write P ( d . q , f I Y F . ~ )= P(YF I f , d . p , m ) P ( f , d . q I m)/p(yF I m ) = f ( q . y ~ ) P ( q I 4 P ( d I f j P ( f ) l P ( y ~I m j . Combining, P(YH I Y F . ~ ) = Cd,q./ dfdh h(q*YH) P ( h I d)f(q. YF) p ( q 1 d , P ( d 1 f)‘(f)lP(yF I m). In the usual way, we take P(fj to be a constant, have the P(q 1 d ) restrict q to lie outside of d x , and then break the integral overf into two integrals, one over the values off(x E d x ) , and one over the values off(x $ d x ) . The integral overf(x $ d x j transforms thef(q. Y F j term into a constant, independent of q and YF. We can then bring back in the P(f),and replace the integral overf(x E d x ) with an integral over all f (up to overall proportionality constants). So up to proportionality constants, we are left with p(YH
1
dh h(q. YH) P ( h I dl p ( q I d, p ( d ) / p ( ! ! F
I YF? m ) = d19
I m,
Now write ‘(YF I m, = Cd,q,fdfp(yF I f . d . q)P(f. 4 I Cd,q.I’dff(q, YF)‘(q 1 d ) P ( d I f ) . As before, the P(9 I dj allows us to break the integral over allf into two integrals, giving us some overall constant. so p ( y ~I YF.?n) cx E d , q . I ’ d h k ( q , y ~ ) P (1 h djP(q 1 d ) P ( d ) . However this is exactly the expression we get if we write out P(YH1 m). QED. 6As an aside, consider the case where Y = 11.2,. . . , r } and r is odd. Now €[(YH)‘ I m] - 2 E ( y ~I m ) E(YF 1 m ) = - [ E ( ~ F I m)]’ E { [ y H - E(YF I m)]’ 1 m } . So guessing such that YH always equals E(YF 1 m j gives best behavlor. In particular, for the likelihood of equation (2,1), E(yF I m ) = C:=, i/r = ( r + 1j/2, so best behavior comes from guessing halfway between the lower and upper limits of Y. Similarly, E [ ( Y H ) * 1 m] = Var(yH1m) + [ E ( Y H I m)]: [with the implicit notation that “Var(n I b)” is the variance of a, for probabilities conditioned on b]. So for two learning algorithms A and B with the same E(y* I mj, guessing the algorithm with the smaller variance is preferable. These results justify the claims made at the beginning of this subsection, for the special case of a uniform prior. ds
+
David H. Wolpert
1412
(The s l i m are implicitly restricted to d y s zoitli the sanw iiiiniber of elements as dx.1
Intuitively, if B is a scrambled version of A, then B is identical to A except that As correlation between its guess h and the (output components of the) data has been scrambled in B. As an example, view (a deterministic) algorithm A as a big table of h values, indexed by training sets d. Then one scrambled version of A is given by the same table where the entries within each block corresponding to a particular d,y have been permuted. Note that if for certain training sets and targets A creates hs with low training set error, then in general a scrambled version of A will have much higher error on those training sets. Note also that scrambling does not involve randomizing the hs the algorithm can guess. It does not touch the set of hs. Rather scrambling involves randomizing the rule for which li is associated with which d . The following result in proven in Appendix A:
Theorem 16. Assunw the (zJertica/)/ike/diood qfeqtiaticirr (2.11. Then if learning a l p r i t h i ~B is a scrambled wrsioii of A, itfolloius tliatfor imiforin P(f ) ,P(!y,, 1 R. n i ) ~=P(yff1 A . vr ). Combining Theorem (16) and Corollary (5), we see that for the likelihood of equation (2.1) and OTS error, if algorithm B is a scrambled version of algorithm A, then the algorithms have the same uniform average over ,f of P( c 1 ,f. 111 ). This constitutes an NFL theorem relating an algorithm to its scrambled version. Note that if the sampling distribution ii(x) is uniform, we can relax the definition of scrambling to simply have I,, P(h 1 d . A ) = CtrP(iz I d. B ) and Theorem (16) will still hold. [For such a case the non-zero values of P ( q l l x j will be a function purely of nr' that can be absorbed into the quantity fund n i . m') discussed in Appendix A.] ~
5.3 Implications of the Equivalence of any Learning Algorithm with Its Scrambled Version. The results of the preceding subsection tell us that there is no a priori reason to believe that there is any value in a particular learning algorithm's relationship between training sets rl and resulting hypotheses it. All that matters about the algorithm is how prone it is to guess certain yHs,not the conditions determining when it makes those guesses. However, these results are not as strong as the NFL theorems. As an example, these results d o tell us that cross-validation (used to choose among a prefixed set of learning algorithms) and its scrambled version always gives the same uniform f-average of P(c 1 f . 1 1 1 ) (if one has OTS error, etc.). So in that sense there is no a priori justification for the ddependence of cross-validation, even for nonhomogeiieousloss functions. However for nonhomogeneous loss functions we cannot automatically
Existence of Distinctions Between Learning Algorithms
1413
equate cross-validation with anti-cross-validation, as we could for homogeneous loss functions. This is because the two techniques can have different P(yH I m ) . In fact, there are some scenarios in which crossvalidation has a better uniform f-average of P(c 1 f.m ) , and some in which anti-cross-validation wins instead. Not only cannot one equate the techniques, one cannot even say that the relative superiority of the techniques is always the same. Example 7: As an example of where anti-cross-validation has better uniformf-average of its expected cost than does cross-validation, let both techniques be used to choose between the pair of learning algorithms A and B. Let A be the algorithm “always guess hl regardless of the data” and B be the algorithm ”always guess h2 regardless of the data.” Let the training set consist of a single element, and let the ”validation set” part of the training set be that single element. So cross-validation will guess whichever of hl and h2 is a better fit to the training set, and anti-crossvalidation will guess whichever is a worse fit to the training set. Let Y = { 1.2,3}, and X = { 1.2. . . . . n } where n is even. Let hl (x) = 2 for x E { 1.2. . . . . n/2}, and let it equal 1 for the remaining x. Conversely, h2(x) = 1 for x E (1.2,. . . . n/2}, and 2 for the remaining x. Assume we are using the quadratic loss function to measure C, and that both cross-validation and anti-cross-validation use the quadratic loss function to choose between hl and h2. Assume the likelihood of equation (2.1) and a uniform sampling distribution. Use a uniform P(f). As shown in example (6), for this scenario, €(C 1 m ) = € { [ y -~ € ( y ~ I m)]’ I m}. For our scenario, for a uniform P(f),E ( ~ 1F m ) = 2. So in comparing the uniform f-average of expected cost for cross-validation with that for anti-cross-validation, it suffices to compare the values of E[(YH- 2)2 I m]for the two techniques. Now by symmetry, whatever dx is, du is equally likely to be 1, 2, or 3. Therefore if dx E (1.2.. . . n / 2 } , for cross-validation, there is a twothirds chance of choosing h, (cross-validation will choose kl if dy = 2 or 3, for such a dx). Similarly, for anti-cross-validation, there is a two-thirds chance of choosing hZ. Now since q must lie outside of dx, for such a dx, E[(yH - 2)2 I m]is smaller if YH is given by h2 than if it is given by hl. So anti-cross-validation does better if dx E { 1.2, . . . . n/2}. For analogous reasons, anti-cross-validation also does better if d x tz { l + n / 2 . . . . . n } . So anti-cross-validation wins regardless of dx. QED. Example 8: As an example of where cross-validation beats anti-crossvalidation, consider the same scenario as in the preceding example, only with different hl and h2. Have hl = 2 for all x, and h2 = 1 for all x. Cross-validation is more likely to guess hl, and anti-cross-validation is more likely to guess h2, regardless of d x . However hl has a better € { [ y -~ € ( y F I m)]’ I m} than does h2, again, for all dx. QED. Note the important feature of the preceding two examples that whether cross-validation works (in comparison to anti-cross-validation) depends
1414
David H. Wolpert
crucially on what learning algorithms you are using it to choose among. This is another illustration of the fact that assuming that cross-validation works is not simply making an assumption concerning the target, but rather making an assumption about the relationship between the target and the learning algorithms at hand. Intuitively, if one uniformly averages over all 11, and h, there is no statistical correlation between the training set and the cost, regardless of whether one uses algorithm A or algorithm B. When the loss function is homogeneous, this lack of correlation results in NFL theorems. When the loss function is not homogeneous, it instead gives the results of this section.
6 Conclusions and Future Work .These two papers in1,estigate some of the behavior of OTS (off-training set) error. In particular, they formalize and investigate the concept that “if you make no assumptions concerning the target, then you have no a5surances about how well you generalize.” As stated in this general manner, this no-assurances concept is rather intuitive and many researchers wo~ildagree with it readily. However, the details of what ”no assurances” means are not so obvious. For example, for the conventional loss function, noise process etc. studied in computational learning theory, there are indeed no generalization assurances associated with averages. [As a result, if we know that (for example) averaged over some set of targets, F , the generalizer CART is superior to some canonical neural net scheme, then we also know that averaged over targets not contained in F , the neural net scheme is superior to CART.] On the other hand, there may be some assurances associated with nonaveraging criteria, like head-to-head minimax criteria. In addition, for certain other noise processes and/or certain loss functions one c m , a priori, say that one algorithm is superior to another, even in terms of ai’erages. More generally, for all its reasonableness when stated in the abstract, the full implications of the no-assurances concept for applied supervised learning can be surprising. For example, it implies that there are “as many” targets (or priors over targets) in which any algorithm performs i ( ~ w s trl i m rmdonz as there are for which it performs better than random (whether one conditions on the training set, on the training set size, on the target, or what have you). In particular it implies that cross-validation fails as readily as it succeeds, boosting makes things worse as readily as better, active learning and/or algorithms that can choose not to make a guess fail as readily as they succeed, etc. All such implications should he kept in mind when encountering quotes like those at the beginning of the introduction of paper one, which taken at face value imply that,
Existence of Distinctions Between Learning Algorithms
1415
even without making any assumptions about the target, one can have assurances that one's learning algorithm generalizes well. In addition to these kinds of issues, in which the generalizer is fixed and one is varying the target, this second paper also discusses scenarios where the generalizer can vary but the target is fixed. For example, this second paper uses such analysis to show that if one averages over generalizers cross-validation is no better than "anti"-cross-validation, regardless of the target. In this sense, cross-validation cannot be justified as a Bayesian procedure-no prior over targets justifies its use for all generalizers. There are many avenues for future research on the topic of OTS error. Restrictions like homogeneous noise, vertical likelihoods, fixed (as opposed to random variable) sampling distributions, etc., are all imposed to "mod out" certain "biasing" effects in the mathematics. Future work investigates the ramifications of relaxing these conditions, to precisely quantify the associated effects. As an example, local noise in the observed X values will have the effect of "smoothing" or "smearing" the target. With such noise, the likelihood is no longer vertical, and information from the training examples does, a priori, leak into those off-training set x that lie near d x . So one does not get the NFL theorems in general when there is such noise. However, one can imagine restricting attention to those learning algorithms that "respect" the degree of smoothness that noise in X imposes. Presumably NFL-like results hold within the set of those algorithms, but none of this has been worked out in any detail. More generally, future research addresses the following issues: Does smoothing help techniques like cross-validation? If so, can prior knowledge concerning a target's smoothness be used to "fine-tune" the technique of cross-validation? How could smoothness be defined for categorical spaces (for which, empirically, cross-validation works well)? Some other avenues of future research involving OTS error are mentioned in Knill et al. (1994). However even if one restricts attention to the limited NFL aspect of OTS error, and even for the restrictions imposed in this pair of papers (e.g., vertical likelihood), there are still many issues to be explored: 1. Characterize when cross-validation beats anti-cross-validation for nonhomogeneous loss functions.
2. Investigate distributions of the form P(c1. c2 I . . . ), where c, is the error for algorithm i, when there are more than two possible values of the loss function. 3. Investigate the validity of the potential "head-to-head minimax" justification for cross-validation, both for the scenario discussed in this paper, and also for nonhomogeneous loss functions and/or empirical-loss-conditioned distributions, and/or averaging over hypotheses rather than targets.
David H. Wolpert
1416
4. Investigate whether there are scenarios in which one generalizer is head-to-head minimax-superior to the majority of other generalizers. Investigate whether there are “head-to-head minimax superior cycles,” in which A is superior to B is . . . superior to A, and if so characterize those cycles. 5. Investigate for what distributions over hypotheses h do sums over /is with the target f fixed result in the majority algorithm beating the antimajority algorithm. 6. Investigate sums over hs with f fixed whenf and/or h is not singlevalued [note, for example, that if f ( q . y j = l / ~ V qand y, then all hs have the same value of CI,,,.,,,f(4.yFjh([7.~/H)L(yH.yFj if L ( . . .) is homogeneous]. 7. Investigate sums over hs where f too is allowed to vary. 8. Investigate under what circumstances E(Cl,D 1 nr) < E(Cors 1 m). 9. Find the set of P ( f ) s such that all learning algorithms are equivalent, as determined by P ( c I mj. There are other P ( f ) s besides the uniform one for which this is so. An example given in the text is that for fixed homogeneous noise, the distribution that is uniform over ci,s results in NFL. Find the distributions G ( o ) over priors 0 such that all algorithms have the same average according to G(( I j of P(c 1 ( I . m ) . For any pair of algorithms, characterize the distributions G ( ( I for which the algorithms have the same average according to G(ci j of P(c 1 ( P . I ? / ) . 10. Find the set of P ( f ) sfor which two particular algorithms are equivalent according to P ( c I I n ) . Find the distributions T ( f j such that two particular algorithms have the same average [according to T (f ) ] of P ( c I f . Hi). 11. Investigate the empirical-loss-conditioned distributions and puntsignal-conditioned distributions mentioned as ”beyond the scope of this paper” in Section 5 of paper one. 12. Investigate what priors P(f)have E ( C ~ T S1 no punt.m) 5 E ( C O T S I punt. m ) . 13. Investigate if and how results change if one conditions on a value of ni’ rather than m. 14. Carry through the analysis for error C‘ rather than C. 15. Investigate what set of conditions gives NFL results for any algorithm in an equivalence class of algorithms that are scrambled versions of one another (as opposed to NFL results for all algorithms). As an example, it seems plausible that any P ( d ) for which each $(x) is determined by IID sampling some distribution R(!/)i.e., any P((i,) = X[~h(x)]-results in NFL-style equivalence between all algorithms in any scramble-based equivalence class of algorithms. [The restricting to the equivalence class is necessary since an algorithm with a disposition to guess argmax[R(y)] will
n,
Existence of Distinctions Between Learning Algorithms
1417
likely outperform one that tries to guess argmin[R(y)].] Similarly, it seems plausible that if one keepsf fixed but averages over all bijections of X to itself-i.e., if one averages over all encodings of the inputs-then all algorithms in such an equivalence class have the same generalization performance. (This latter result would constitute a sort of NFL-for-feature-representationsresult.) 16. Extend the work in these papers to costs not expressible via loss functions, unsupervised learning, prediction of an algorithm's error, and prediction of the Bayes error.
Appendix A. Miscellaneous Proofs Proof of Lemma (3). Expand the sum in question as
b[c,L(YH- YF)] P(9 1 d)f(q,YF) k(9) N x 2 d x ) 9 YH.YJ
Since P ( q I d) = 0 for q E d x , this sum is independent of the values of h(x E dx). Accordingly, up to an overall constant, we can rewrite it as
6[c$L(YH.YF)l P(9 I d)f(93 YF) k(q) h.q,yH.YF
We can use the reasoning of the NFL theorems to calculate this. Intuitively, all we need do is note by Lemma (1) (see paper one) that our sum equals ,f df P(c I f.d) if one interchanges f and k, and then invoke Theorem (1) of paper one. More formally, follow the reasoning presented after Lemma (l),only applying it to our distribution rather than to f d f P ( c 1f.d). The result is Lemma (3) [rather than Theorem (l)].QED. Proof of Theorem (11).Expand our sums as
Consider any set of values of kl(x E dx).k2(x E dx) such that the strategy at hand picks k,. For such a set of values, our inner sum becomes (up to an overall proportionality constant determined by m') Ch,(renx) P(c I f . h l , d), with obvious notation. By Lemma (3), this is simply some function X(c), independent off and d. The same is true for those sets of values of kl(x E dx).hz(x E dx) such that the strategy at hand picks h2. Therefore, up to overall constants, C h l . h Z P(cl hl. k2.d) is just X(c). This is true regardless of which strategy we use. QED.
If.
Proof of Theorem (16). For a uniform P(f),P(yH I m) IX Cq,d j dfdk k(9. If). The integral overf works only on the P(d If). For the likelihood at hand, it is some function func(m.m'), independent
yH) P(h I d ) P(q 1 d ) P(d
David H. Wolpcrt
1418
of d. Therefore we have
Appendix B. Proof of Theorems (13) and (14) Let H,,,(c)indicate the set of k s such that E(CIID I k . Q ) 5 :. This appendix analyzes the sign of CHI,HZtHc,(EIIE(C I k l . h ~ . + b . d . A-) E(C I k l . k 2 . $ 5 . d . B ) ] for OTS error. In particular, it shows that this quantity is necessarily positive for zero-one loss for any r < 1. We can express El,:>(:) as the set of ks such that
1L [ / i ( q )44)1"(4) . +
L [ / i ( q )d. q ) ] 4 q )5 :
(B.1)
~l!Af\
qEd\
It is useful to introduce the following notation. Define S ( k . 07.5) =
C , l ~ ~ ~ [ / i ( q ) . f ~Indicate ) ( q ) ] ~a (value 9 ) . of S ( k , . c ) . d x ) as sl, and a value ) sz. By B.l, the set of hl under consideration is those for of S ( h z .(p. d , ~as which c 2 sI, and such that for OTS error C, E(C 1 /11.(i).d)5 {E - s ~ } / T ( X- d,) and similarly for E(C I
h2.
(B.2)
q . d ) . [The expression ""(5)" is shorthand for
C,EE "(S).]
We are interested in & I , H 2 E H , , ( E , E(C I k l . h2.4. d. i), where i is either A or B. As usual, expand the sum into an outer and inner sum, getting
where the restriction that both hi and hz lie in H+,(:) is implicit. Consider the inner sum, for any particular set of values of hl(x E d y ) and k2(x E d x ) . Without loss of generality, say that for the d at hand and the values of hl(x E d x ) and k2(x E d x ) , strategy A picks k l . This means that s1 5 5 2 . Examine the case where s1 is strictly less than s?. Using B.2, the inner sum over hl(x 4 d x ) is over all Ii(x 4 d x ) such that E(C I q)./z.d) 5 { ~ - s , } / ~ ( X - d x )and , the inner sum over h2(x $ d x ) is over all kz(x 4 d x ) such that E(C 1 (,h.h.d) 5 { E - s z } / n ( X - d ~ ) . Since s, < s2, this means that the ks going into the sum over kl(x 4 d x ) are a proper superset of the /is going into the sum over hz(x 4 d x ) . In addition, for all the k s that are in the hl sum but not in the h2, E(C 1 d. h. d ) 2 0. [In fact, those ks obey { E - S Z } / T ( X- d x ) < E(C I q. h. d ) I
Existence of Distinctions Between Learning Algorithms
1419
( E - s l } / n ( X - d x ) . ] Therefore so long as the set W(s1.s2) of hs in the 111 sum but not in the h2 sum is not empty, the k l sum is larger than the II? sum. This means that for all cases where s1 is strictly less than s2 and H'(s1. s2) is nonempty, CH,(l~,,x),Hz(r~,{x) E(C I h l . h2. d. d. i) is larger for algorithm A than for algorithm B. If s1 = s2, then ~ H l ( r ~ , ~ x ) , H zE(C (,~~ 1 ~ui h,. h2.q. d. i) is the same for both algorithms. So if there are any h l and h2 in the sum such that the associated H * ( s l . s z ) is nonempty, it follows that C H , , H z E H , ( E E(C ) I h l . h2. (1. d. i ) is larger for algorithm A than for algorithm B. If there are no such kl and h2, CH,,HzEH,(E) E(C 1 h l . h2. o. d. i) is the same for both algorithms. This establishes Theorem (13). To understand when there are any hl and h2 such that the associated H * ( s l .s2)is nonempty [so that we get a strict inequality in Theorem (13)], make the definition smll,(E.Q) = "(4)min,/,,{L[yH.(,1(9)]}, and ~ ) . what we need is for there to be an k , k * , similarly for s m a x ( E . ~Then such that
{ C - smax(dx. O ) } / T ( X-
d x ) < E(C 1 O. k * . d ) I { C - smin(dx. O)}/n(X - dxj
In particular, for zero-one loss, and single-valued f, s,,iI, ( d x . o ) = 0, and smax(dx. cj) = ~ ( d x )So . we need an h' such {: - n(dx))/"(X d x ) < E(C 1 Oi.h*.d) I E / T ( X - d x ) . For any E < 1, there is such an h*. This establishes Theorem (14). ~
Appendix C. Proof of Theorem (15) We need to prove that for a vertical likelihood and OTS error, for uniform P(f), all learning algorithms with the same P(yH 1 m ) have the same P(c I m). First write P(c I m) = C,/HP(c I Y H . V I ) P ( Y HI i n ) [where from now on the uniformity of P(f)is implicit]. If P(c 1 y~ m ) is independent of the learning algorithm, the proof will be complete. So expand P(c 1 yH,m) = &,,P(q.d I YH)P(c j ~ H . q . d ) . I will show that P(c I y H . 4 . d ) is a fixed function of c and YH that is independent of q and d. This in turn means that P ( c I y f j > m )is the same fixed function of c and YH, which is exactly the result we need. Write P ( c I yH. 4 . 4 = f df Cy,b [ c .UYH. y ~ ) l f ( qY. F ) P(f I YH.9 . 4 . Next use Bayes' rule to rewrite P(f I YH. 9. d) as P(YH q. d ) P(f 1 q. d ) / P ( y I~
If.
4.4.
Now write P(YHI 9 . 4 = C ~ h ( q % y ~ ) P1 (d h) . But P(YHI f.9.d) is given by the same sum. Therefore P(f 1 y ~ . q . d )= P(f 1 9 . d ) . Using Bayes' theorem again, P(f I 9.d) = P(9 I d.f)P(d.f)/[P(q 1 d ) P ( d ) ]= P ( d I f ) P ( f ) / P ( d )cx P ( d If)/P(d) [since we have a uniform P(f)]. So combining everything, up to an overall proportionality constant, p(C 1 Y ~ . q . d )= x b [ c + L ( y H . ! / F )/df.f(4.yr)P!d ] '/f
If)/P(n)
David H. Wolpert
1420
Now recall that P ( c I YH.9. d ) is occurring after being multiplied P(q. cl I ) 0 unless q E X - ds. Therefore in I//,). For OTS error, P(9.d I y ~ = evaluating P ( c I yt1.9.d) we can assume that 9 E X - d x . Accordingly, the term . f ( q . ! y ~ )depends only on the values of f(r $ d x ) . Since we have a vertical likelihood, the P ( d 1 f )term depends only on f(x E r f s ) . Accordingly, we can write
111
x
/ df(r E d X ) P ( d l f ) l P ( d )
Up to overall proportionality constants, we can now replace both integrals with r d j . By the same reasoning using in the proof of Theorem (l), the first integral is a constant. If w e reintroduce the constant P(f) into the remaining integral, we get I dfP(d I f ) P ( f ) / P ( d )= 1. Therefore, P ( c 1 !/)I.q . d ) = C,,, h[c.L(yt1. yr)]. [Note that for homogeneous L(.. .), this is independent of yfj.1 As needed, this is independent of q and d. QED.
Acknowledgments I would like to thank Cullen Schaffer, Wray Buntine, Manny Knill, Tal Grossman, Bob Holte, Tom Dietterich, Mark Plutowski, Karl Pfleger, Joerg Lemm, Bill Macready, and Jeff Jackson for interesting discussions. This work was supported in part by the Santa Fe Institute and by TXN Inc.
References
_
_
_
_
~
Knill, M., Grossman, T., and Wolpert, D. 1994. Off-training-set error for the Gibbs and the Bayes optimal generalizers. Submitted. Perrone, M. 1993. Improving regression estimation: averaging methods for variance reduction with extensions to general convex measure optimization. Ph.D. thesis, Brown Univ., Physics Dept. Wolpert, D. 1992. On the connection between in-sample testing and generalization error. Coinylex Syst. 6, 47-94. Wolpert, D. 1994a. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In Tlzc Mnthtwntics ofGcri~.mli~ntiori,D. Wolpert, ed. Addison-Wesley, Reading, MA. Wolpert, M., Grossman, T., and Knill, E. 1995. Off-training-set error for the Gibbs and the Bayes optimal generalizers. Submitted. ~~~
Received August 78, 1995, accepted February 14, 1996
This article has been cited by: 1. Michael Doumpos, Constantin Zopounidis. 2007. Model combination for credit risk assessment: A stacked generalization approach. Annals of Operations Research 151:1, 289-306. [CrossRef] 2. Ralph van Dinther, Roy D. Patterson. 2006. Perception of acoustic scale and size in musical instrument sounds. The Journal of the Acoustical Society of America 120:4, 2158. [CrossRef] 3. D.H. Wolpert, W.G. Macready. 2005. Coevolutionary Free Lunches. IEEE Transactions on Evolutionary Computation 9:6, 721-735. [CrossRef] 4. David R. R. Smith, Roy D. Patterson, Richard Turner, Hideki Kawahara, Toshio Irino. 2005. The processing and perception of size information in speech sounds. The Journal of the Acoustical Society of America 117:1, 305. [CrossRef] 5. Aki Vehtari , Jouko Lampinen . 2002. Bayesian Model Assessment and Comparison Using Cross-Validation Predictive DensitiesBayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities. Neural Computation 14:10, 2439-2468. [Abstract] [PDF] [PDF Plus] 6. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 7. M. Koppen, D.H. Wolpert, W.G. Macready. 2001. Remarks on a recent paper on the "no free lunch" theorems. IEEE Transactions on Evolutionary Computation 5:3, 295-296. [CrossRef] 8. Malik Magdon-Ismail . 2000. No Free Lunch for Noise PredictionNo Free Lunch for Noise Prediction. Neural Computation 12:3, 547-564. [Abstract] [PDF] [PDF Plus] 9. Zehra Cataltepe , Yaser S. Abu-Mostafa , Malik Magdon-Ismail . 1999. No Free Lunch for Early StoppingNo Free Lunch for Early Stopping. Neural Computation 11:4, 995-1009. [Abstract] [PDF] [PDF Plus] 10. David H. Wolpert. 1997. On Bias Plus VarianceOn Bias Plus Variance. Neural Computation 9:6, 1211-1243. [Abstract] [PDF] [PDF Plus]
Communicated by Steven Nowlan
NOTE
No Free Lunch for Cross-Validation Huaiyu Zhu Richard Rohwer Neural Compirting Research Group, Department of Comptrter Science and Applied Matlzemafics, Aston University, Birmingham 34 7ET, U K
It is known theoretically that an algorithm cannot be good for an arbitrary prior. We show that in practical terms this also applies to the technique of "cross-validation," which has been widely regarded as defying this general rule. Numerical examples are analyzed in detail. Their implications to researches on learning algorithms are discussed. 1 Introduction
Recently, practical implications of the so-called "No Free Lunch" (NFL) theorems by Wolpert and Macready (1995) have become a contentious issue among neural network researchers. For our purpose, the NFL theorems can be summarized as the following. With a uniform prior, any algorithm performs as well as random guessing. With a "uniform hyperprior" of priors, any algorithm performs as well as random guessing. The main implication is that if an algorithm performs better than random guessing on some prior, then it necessarily performs worse than random guessing on some other prior, by the same amount. Therefore it is meaningless to say an algorithm is good without specifying the prior. On the other hand, it is widely believed that some frequentist statistical techniques, such as cross-validation (CV) and bootstrap, are universally good. More specifically, it is often claimed that they will automatically discover useful structure in the problem, if there is any, and will be harmless otherwise. Although this has been theoretically shown not to be the case (Wolpert and Macready 1995), we consider it still of great instructive value to see why and how CV fails in a numerical experiment. 2 Experiment and Short Analysis
Suppose we have a gaussian variable x, with mean 11 and unit variance. Nelrrd Coniprtntiori 8, 1421-1426 (1996) @ 1996 Massachusetts Institute o f Technology
Huaiyu Zhu and Richard Rohwer
1422
We have the following three estimators for estimating of size 1 1 . 0
0
0
~i
from a sample
A: The sample mean. It is optimal both in the sense of maximum likelihood and least mean squares. B: The maximum of the sample. It is a bad estimator in any reasonable sense. C: Cross-validation to choose between A and B , with one extra data point.
The numerical result with mean squared errors:
A : 0.0627
ii =
B : 3.4418
16, averaged over 10.000 samples, gives
C : 0.5616
(2.1)
This clearly shows that CV is harmful in this case, despite the fact that it is based on a larger sample. To alleviate concern that the verdict might be caused by statistical fluctuation, we also repeated the experiment with 10’ samples, which g i l w three significant digits. The result is A : 0.0625
B : 3.4137
C : 0.5754
(2.2)
Vote that the theoretical mean squared error for A is 1/16 0.0625. I t might at first appear that this is a very artificial example, which is not what normally occurs in practice. To this we ha1.e two answers, short and long. The short answer is from first principles. Any counterexample, howe\w artificial it is, clearly demolishes the hope that cross-validation is a ”universally beneficial method.” ~
3 Full Analysis and Further Experiments
~
-
__
The longer answer is di\.ided into several parts, which hopefully will m s v e r any potential criticism of any aspect: 1. The cross-validation is performed on extra data points. We are not requiring it to perform as well as the mean on 17 data points. If it cannot extract more information from the one extra data point, a minimum requirement is that it keeps the information in the original 16 points. But it cannot even d o this.
7. Denote by T,, the (7th percentile of the sample. Then the maximum which is in fact a quite of a sample is Tl,l,l. The median is Ti,,, reasonable estimator. Let us use a larger cross-validation set (of size k ) , and replace estimator R with a different percentile. The result is that for CV to work, it needs k >. 2 for the median, and k _16 ’ tor T:,,.
No Free Lunch for Cross-Validation
1423
Table 1: Typical Run Showing the Role of Validation Set Size
Validation size k
1 2 3 5 8 13 21
Number of samples in 10,000 samples C prefers B
B is genuinely better
C picks B correctly
1989 1380 952 601 362 239 139
26 26 26 26 26 26 26
13 15 13 18 16 15 13
3. It is not true that we have set up a case in which cross-validation cannot win because Tloois at the boundary of the interval spanned by the sample. There is indeed a small probability that a sample can be so bad that the sample maximum is even a better estimate than the sample mean. However, to utilize such rare chances to good effect k must be at least several hundred (maybe exponential) while n = 16. We know that such a k exists since k = 00 certainly helps. However, to adopt such a method is clearly absurd. 4. The reason cross-validation fails in this particular case is easy to see. It is true that C will choose A for most of time, as it should be, but whenever it chooses B, it is most likely for the wrong reason. For the case at hand, in 10,000 samples, about 30 are such that B is better than A . However, based on one extra point, C will prefer B for about 2000 cases. Among them, only about 15 samples are for the right reason, that B is genuinely better. For the remainder of the 1985 samples the reason is that the validation point happens to be closer to B. Furthermore, among the 30 worthy cases it fails to pick half of them because the validation point happens to be at the wrong side of the sample mean. As the cross-validation size increases, the chance of picking B for the wrong reason decreases. Table 1 is a typical run showing the role of validation set size. 5. We have chosen estimator A to be the known optimal estimator in this case to make the mechanism easier to understand, but it can be replaced by something else. For example, both A and B can be some reasonable averages over percentiles, such as A = ( T 4 0 + T60)/2 and B = (TI,)+ Ts0 + Tg0)/3,so that without detailed analysis it is hard to see which is better. It may appear that cross-validation would generate C better than both A and B. Such beliefs can be defeated by similar counterexamples. In most cases C will have a performance intermediate between A and B, unless the cross-validation set is enormously larger than the original set.
1424
Huaiyu Zhu and Richard Rohwer
6. The above may give the wrong impression that it is impossible to mix several estimators to get a better estimator, which is not true. If we have a sample of size 101, and sort it in increasing order, then each data point is a percentile, from To to Tloo. Among these 101 different estimators, the optimal is the median T5”. A simple arithmetic average of these estimators gives the sample mean, which is a better estimator than the best of the estimators it is based on, even without using any fresh data. 7. The above scheme of cross-validation may appear different from what is familiar, but here is a ”practical example” showing that it is indeed what people normally do. Suppose we have a random variable that is either gaussian or Cauchy. Consider the following three estimators (see Fisher (1925) for definition and properties of efficiency): 0
0
0
A: The sample mean: it has 100% efficiency for gaussian, and 0% efficiency for Cauchy. B: The sample median: it has 2/7r = 63.66% efficiency for gaussian and 8/7r2 = 81.06% efficiency for Cauchy. C: Cross-validation on an additional sample of size k, to choose between A and B.
Intuitively it may appear reasonable to expect cross-validation to pick out the correct one, for most of the time, so that averaging over all samples C ought to be superior to both A and B. But this is not so, since this will depend on the priur mixing probability of these two submodels. If the variable is in fact always gaussian, then we have just seen that if n = 16, CV will be worse than sticking with A unless k > 2. The same is even more true in the reversed order, since the mean is an essentially useless estimator for the Cauchy distribution. This also shows that averaging among estimators is not always a good thing to do. 8. In many application problems CV is performed on a continuous hyperparameter instead of a choice between discrete alternatives. For an example of this type consider the t distribution, which connects the Cauchy and gaussian distributions by varying the “hyperparameter,” the degrees of freedom m. Suppose we have obtained quite good estimators for each integer m, how can we obtain a good estimator if we do not know m? Cross-validation can be either good or bad, depending on the prior mixing distributions among all the t-distributions. However, if we had known that, we would be better off using Bayesian methods, which might turn out to be CV, but in this case it is highly likely to be some other estimator far better than CV. 9. These examples also cover the case of ”leave-one-out” cross-validation, where k = 1, and exactly ri samples are involved, instead of
No Free Lunch for Cross-Validation
1425
10,000 as we have done. These samples are not independent, so the fluctuation will be bigger. 10. In any of the above cases, ”anti-cross-validation,” an ad hoc term used to denote choosing C to be equal to either A or B according to which one performs worse on the CV set, would be even more disastrous. This, however, in no way promotes the use of CV, since these are just two methods among infinitely many others. 11. The above arguments are within the framework of frequentist statistics, but they nevertheless reveal the essential role a prior must play in a theory of learning algorithms. On the other hand, if one starts from a Bayesian framework, then due to the coherence of Bayesian theory, there is no need to revert back to frequentist framework (Zhu and Rohwer 1995). 4 Discussion It is well accepted that there exists prior knowledge in practical problems, including smoothness, symmetry, positive correlation, iid samples, etc. These are indeed the implicit assumptions behind most learning algorithms. NFL tells us that if our algorithm is designed for such a prior, then we should say so explicitly so that a user can decide whether to use it. We cannot expect it to be also good for any other prior that we have not considered. In fact, in a sense, we should expect it to perform worse than a purely random algorithm on those other priors. Explicit treatment of smoothness priors in practical regression problems was studied in Zhu and Rohwer (1996). The power of the NFL is in a sense related to the fact that if the grand total is bound to be zero, then it is impossible to make every term positive. The apparent existence of learning rules that work well on examples without explicit requirement that these examples come from a nonuniform prior does not in any way contradict the NFL theorems, just as the apparent existence of machines doing useful work without obviously visible source of energy does not in any way contradict the principle that perpetual motion is impossible. As early as 1775, the Parisian Academy of Science decided that they would no longer examine any invention of ”perpetual motion machines,” on the ground that the law of energy conservation is so reliable that it will defeat any such attempt (Ord-Hume 1977). Such a decision helped to direct human talent into the realizable effort of designing machines that utilize the energy in fuel. Should we expect the same fate for the ”universally beneficial methods” in the face of NFL, and put more effort into designing methods that are good for particular priors encountered frequently in practice? These general principles might not appear to be of obvious interest to a user, but they are of fundamental importance to a researcher. They are
1426
Huaiyu Zhu and Richard Rohwer
in fact also of fundamental importance to a user, as he must assume the responsibility of supplying the energy source, or specifying the prior.
Acknowledgment This work was partially supported by EPSRC Grant GR/J17814.
References Fisher, R. A. 1925. Theory of statistical estimation. Proc. Cnmb. Phil. Soc. 22, 700-725. Ord-Hume, A. W. J. G. 1977. P f r ~ i ~ t i i i i l M o t j o iThe i : History ofnii Obsrssioii. George Allen & Unwin, London. Wolpert, D. H., and Macready, W. G. 1995. No Free Lurich TI7cu1.rr11sfur Senrcli. Tech. Rep. SFI-TR-95-02-010, The Santa Fe Institute. Zhu, H., and Rohwer, R. 1995. Bayesian invariant measurements of generalisation. N t w n l Proc. Lctt. 2(6), 28-31. Zhu, H., and Rohwer, R. 1996. Bayesian regression filters and the issue of priors. N m r a l Cornp. Appl. 4, 130-142. ~ ~ _ _ _ _ ~ Received January 4, 1996, accepted February 14, lY96
This article has been cited by: 1. Talin Haritunians, Kent D. Taylor, Stephan R. Targan, Marla Dubinsky, Andrew Ippoliti, Soonil Kwon, Xiuqing Guo, Gil Y. Melmed, Dror Berel, Emebet Mengesha, Bruce M. Psaty, Nicole L. Glazer, Eric A. Vasiliauskas, Jerome I. Rotter, Phillip R. Fleshner, Dermot P.B. McGovern. 2010. Genetic predictors of medically refractory ulcerative colitis. Inflammatory Bowel Diseases n/a-n/a. [CrossRef] 2. Pilar Beatriz Garcia-Allende, Venkataramanan Krishnaswamy, P. Jack Hoopes, Kimberley S. Samkoe, Olga M. Conde, Brian W. Pogue. 2009. Automated identification of tumor microscopic morphology based on macroscopically measured scatter signatures. Journal of Biomedical Optics 14:3, 034034. [CrossRef] 3. R. Thompson, D. Price, N. Cameron, V. Jones, C. Bigler, P. Rosén, R. I. Hall, J. Catalan, J. García, J. Weckstrom, A. Korhola. 2005. Quantitative Calibration of Remote Mountain-Lake Sediments as Climatic Recorders of Air Temperature and Ice-Cover Duration. Arctic, Antarctic, and Alpine Research 37:4, 626-635. [CrossRef] 4. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 5. Jakobus P. Barnard, Chris Aldrich, Marius Gerber. 2001. Identification of dynamic process systems with surrogate data methods. AIChE Journal 47:9, 2064-2075. [CrossRef] 6. Malik Magdon-Ismail . 2000. No Free Lunch for Noise PredictionNo Free Lunch for Noise Prediction. Neural Computation 12:3, 547-564. [Abstract] [PDF] [PDF Plus] 7. Isabelle Rivals , Léon Personnaz . 1999. On Cross Validation for Model SelectionOn Cross Validation for Model Selection. Neural Computation 11:4, 863-870. [Abstract] [PDF] [PDF Plus] 8. Zehra Cataltepe , Yaser S. Abu-Mostafa , Malik Magdon-Ismail . 1999. No Free Lunch for Early StoppingNo Free Lunch for Early Stopping. Neural Computation 11:4, 995-1009. [Abstract] [PDF] [PDF Plus] 9. Huaiyu Zhu , Wolfgang Kinzel . 1998. Antipredictable Sequences: Harder to Predict Than Random SequencesAntipredictable Sequences: Harder to Predict Than Random Sequences. Neural Computation 10:8, 2219-2230. [Abstract] [PDF] [PDF Plus] 10. Cyril Goutte. 1997. Note on Free Lunches and Cross-ValidationNote on Free Lunches and Cross-Validation. Neural Computation 9:6, 1245-1249. [Abstract] [PDF] [PDF Plus]
Communicated by Steven Zucker
A Self-organizing Model of ”Color Blob” Formation Harry G . Barrow Alistair J. Bray Julian M. L. Budd School of Cognitive m d Computing Sciences, Uniorrsity of Siissex, Fnlmrr, Brighton, Enst Siisscx BN1 9QH, U K This paper explores the possibility that the formation of color blobs in primate striate cortex can be partly explained through the process of activity-based self-organization. We present a simulation of a highly simplified model of visual processing along the parvocellular pathway, that combines precortical color processing, excitatory and inhibitory cortical interactions, and Hebbian learning. The model self-organizes in response to natural color images and develops islands of unoriented, color-selective cells within a sea of contrast-sensitive, orientation-selective cells. By way of understanding this topography, a principal component analysis of the color inputs presented to the network reveals that the optimal linear coding of these inputs keeps color information and contrast information separate. 1 Introduction
Cytochrome oxidase (CO), an endogenous metabolic marker (see WongRiley 1994), produces a characteristic patchy or ”blob-like” pattern of staining in the deep and especially the superficial layers of primate area V1 (Wong-Riley 1979; Carroll and Wong-Riley 1982; Horton and Hubel 1981; Hendrickson et al. 1981; Horton 1984; Livingstone and Hubel 1984). The CO blobs mark an important type of functional segregation in the monkey primary visual cortex. Cells within CO blobs tend to have concentric, monocular, low spatial frequency, color-selective receptive field properties, but cells outside (“interblob” region) generally have complex or hypercomplex, oriented, binocular, high spatial frequency, broadband-selective responses (Livingstone and Hubel 1984; Ts’o and Gilbert 1988). At the blob-interblob boundaries, however, neurons appear to have mixed responses with complex, oriented, and color-selective receptive fields (Ts’o and Gilbert 1988). Together with the observed smooth variation in the density of cytochrome oxidase staining (Trusk et al. 1990) and the uniformity of dendritic field size in blob and interblob regions (Malach 1992), this mixed response suggests a continuum of receptive field properties rather than a discrete segregation of function. The CO Neirral Coinputntion 8, 1427-1448 (1996) @ 1996 Massachusetts Institute of Technology
1328
H. Cf Barro\\’,A . J . Brav, and J. M L,. Budd
blobs, similar in size to physiologically identified “color columns” (Hubel and Wiesel 1968; Gouras 1974; Michael 1981), therefore represent vertically aligned groups of cells with color-selective receptive field properties. The precise nature of CO blob cell rect.pti1.e fields is, however, much less clear. First, cells within a single CO blob may code for only one color contrast, that is, red-green (R-G), blue-yellow (B-Y), broad-band opponency (e.g.,Ts’o and Gilbert 1988),or haire only mixed color opponency, for example, some cells R-G and others are B-Y opponent (e.g., Livingstone and Hubel 1984). Second, many CO blobs cells were wiginally thought to have ”double opponent” properties, cells with centersurround spatial structure but with color opponency in each subfield (Livingstone and Hubel 1983). I t is now believed that double opponent cells are rare as most cells initially classified as double opponent have broad-band surrounds, so-called “modified Type I1 cells” (for a discussion see Ts’o and Gilbert 1988). In fact, most of the cells within CO blobs have receptive field types similar to those present in the geniculate (cf. Livingstone and Hubel 1981; Wiesel and Hubel 1966). How d o CO blobs develop? Current esridence supports the view that the formation of CO blobs is, to some degree, genetically predetermined. First, in the macaque monkey, CO blobs appear many weeks before birth (Horton 1984). Second, there appear to be differences in the type of geniculate input terminating in CO blob compared to interblob regions (Lachica ct 01. 1992; Nealy and Maunsell 1994). Third, normal CO blob devdopment in the macaque monkey is apparently unaffected by the absence of 1,isual stimulation when both eyes are removed prenatally (Kuljis and Rakic 1990). Last, the mean size of a CO blob (relative to the growing cortex) during maturation changes little (Purves and Lamantia 1993). The size of CO blobs, however, can be altered by the level of neural activity postnatally. The removal or blockade of the retinal impulses from one eye in the adult macaque monkey, for example, leads to a considerable shrinkage in CO blob size in the deprived-eye column (Horton 1984; Wong-Riley and Carroll 1984; Trusk ef 01. 1990). So while the CO blobs may form without visual stimulation, their postnatal plasticity suggests that the maintenance of CO blobs may depend, at least partly, on activity-based mechanisms. But w e n if one assumes that CO blobs are genetically predetermined, it is far from clear if the receptive field properties within and between the CO blobs can develop without visual stimulation. So far, no empirical work has compared the codevelopment of CO blobs with the functional properties of neurons. The possibility therefore exists that, like other forms of functional segregation in the central nervous system, actilrity-dependent self-organizing mechanisms may play some part in the deldopment of color columns. In this paper, we explore this possibility by training an unsupervised neural network model of the early visual pathway on inputs from natural color images. We find that color-selective, blob-like formations develop without any initial bias, through self-organization alone. Some under-
Model of Color Blob Formation
1429
standing of the behavior of this network is achieved though analysis of the statistics of natural images. A preliminary account of this work has been presented elsewhere (Barrow and Bray 1992). 2 A Network Model
In this section we outline a simple network model of processing in the early visual pathway. In combining excitatory and inhibitory cortical interactions with Hebbian-type learning to generate a topographic mapping, the model falls within an approach that has yielded a degree of success. Early work by von der Malsburg demonstrated such a combination was sufficient for learning orientation selectivity with simple stimuli (von der Malsburg 1973). More recently, works by others such as Miller ef a ] . (1989), Miller (1990, 1994), and Goodhill (1993), have demonstrated that such models can provide an explanation of the formation of retinal topography, orientation, and ocular dominance columns, and the separation of the "on" and "off" signals. Linsker (1990) and more recently Miller (1994) have demonstrated that these correlation-based models can be partly understood through examination of the correlations in their inputs. Erwin ef al. (1995) have lately provided an excellent critique of such models. All the above work deals with black and white imagery; none considers color inputs, and the separate correlations they entail. In this work, we examine whether a similar-styled explanation to that provided by such models for various types of functional segregation in the cortex can also be given for the development of color blobs. Because the output of such an activity-based model is determined by the correlations within the inputs (Miller 1994), it is essential that both the statistics of the color inputs presented and precortical processing on these inputs are realistic. Accordingly, we use input from natural color images, and model preprocessing at the retina and lateral geniculate nucleus (LGN), similar to that for the parvochannel in macaque monkeys. However, we recognize that an accurate model of color processing in primates (in line with detailed, known neurophysiology) would have to be considerably more detailed than that described here, involving a more complicated architecture. Our model attempts to maintain the simplicity of the similar models mentioned above, while accommodating essential biological constraints, to show that Hebbian development on visual scenes can lead to structures similar to those found in actual cortex; we do not pretend to model such biological structures precisely. The simple architecture of our model is shown in Figure 1. 2.1 Precortical Processing. Inputs are presented from color images, represented as arrays in the red, green, and blue spectral bands. We model responses of on-center and off-center cells for the broad-band and
H. G. Barrow, A. J. Bray, and
1430
J.
M. L. Budd
Figure 1: Architecture of the network model. The model consists of three stages: the retina, LCN, and striate cortex. We model the on- and off-center cells for the broad-band channel and for the red /green color-opponent channels so that six types of retinal ganglion cell project from the retina to the I G N . Lateral inhibition is applied t o the channt4s independently in the LGN, and the corresponding six types of geniculate cells project to cells in the cortex. We rnodel the cortex using a dual-Fo~iiilationmodel that simulates short-range lateral excitation and longer-range inhibition. When the network acti\,ity is stable, the synaptic weights between geniculate and cortical cells adapt according to cl I-iebbian-type rule.
'
red/green channels There are therefore SIX types of retinal a n d six types of LGN cells (Table 1) On- and off-center channels a t the retina ~~
~
' h e ignort. it415
~
__ the blue \ t \ l l o ~c h m n e l w i c e
it
in\ol\ es mil\ a h o u t h",] o f a11 ganglion
Model of Color Blob Formation
1431
Table 1: Retinal and LGN Cells On-channel
Retina
LGN
Off-channel
Retina
LGN
Broad-band Red-green Green-red
Rf
G
R& RLr
Lr'p Llr
Broad-band Red-green Green-red
R, R'R Ri*
L , L'g Lir
Table 2: Outputs of the Retinal Cells Linear On-channel Broad-band Red-green Green-red Off-channel Broad-band Red-green Green-red
Nonlinear
+ b. 0)
r$ = G , * I, - k . G,, * I,, r& = G,, * I, - k . G,, * r;, = G, * Ig - k . G,, * I,
Rr'p = ma,(,& R$ = max(r:,
+ b. 0)
r; = G,, * r, - k . G,, * r,, r,; = G,, * Ig - k . G,, * I, r;, = G, * I, - k . Grc *
R; = max(r; R; = ,ax(,, R; = max(r;,
+ b. 0)
Rf = max(rf
+ b. 0)
+ b.O)
+ h. 0)
are modeled using a difference of gaussians, where each gaussian kernel is applied to a particular spectral band. For example, the output of a redgreen opponent cell is a function of the difference between a small red sensitive gaussian center and a larger green-sensitive gaussian surround. Assuming a small retinal gaussian G, and a larger one G,, we define nonlinear outputs R of the retinal cells as thresholded versions of linear difference-of-gaussian signals r (Table 2) where I,. I,. I, are the red, green, and gray-level intensity images (i.e., outputs of the retinal cones) and * is the convolution operator. Each gaussian is normalized to have unit integral, and k determines the relative weighting of center and surround. To obtain the nonlinear output R we superimpose Y on a background firing rate b and clip the negative component of the new signal to zero. In line with Wehmeier et al. (1989), the gaussians have standard deviations in the ratio of 1:3; in line with Robson (1983) k = 0.88. These six signals are carried (by the retinal ganglion cells) to the LGN. The broad-band channel carries mainly luminance contrast information, whereas the red / green channels carry color information. In Figure 2a the output of the redgreen on-center ganglion cells is shown when the picture in Figure 6a is presented as input. We use difference-of-gaussian processing again at the LGN to simulate local inhibition. We assume a central gaussian the same size as the retinal one olC= nrC,and a larger surround such that nla= 2~~1,. We also assume
H C; Barrow, A J Bray, and J M L Budd
1432 ~
_ ~ _ _
~
~
-
Figure 2: Output of retinal and geniculate cells. (a) On the left, the output R,; of the red-green on-center retinal cells is shown \\.hen the image in Figure 6(a) is presented. While broad-band cells carry spatial contrast information alone, i t can be seen that the color-opponent cells also code the difference between red and green spectral input. Hence cells in the red body of the parrot are highly active. (b) The output of the geniculate cells L;, given the same input, is shown (histogram equalization has been used to aid display). Lateral inhibition at this stage enhances spatial contrast and removes the DC color signal. As a result, color information is transmitted only where there is also spatial contrast, i.e., a t boundaries. Table 3: Output of the Six Cell Types
equal weighting of the center a n d surround so that, within image regions o f uniform color a n d intensity, the LGN response will be zero. Again, w e superimpose a linear signal i against a background firing rate b’ t o provide the output L of the six cell types (Table 3 ) . In Figure 2b the output of the red-green on-center geniculate cells is shown when the picture in Figure ha is presented; the color information carried by the retinal cells n o w remains only where there is also a luminance contrast.
Model of Color Blob Formation
1433
2.2 Cortical Processing. Our model of simple cell connectivity and self-organization is based on that first proposed by von der Malsburg (1973) and Miller (1990). It consists of two types of unit representing the two broad categories of cells in the primate striate cortex. The excitatory units represent spiny cells and the inhibitory ones represent smooth cells. Their numbers are in the ratio of 4:l to reflect observed cell counts. Both types of cells receive excitatory feedforward input from the LGN. Excitatory cells excite all neighboring cells within a short radius, and the inhibitory cells inhibit all neighbors within a larger radius. The cortical units receive feedforward input through their connections to geniculate cells. The recurrent network settles into a stable state where intracortical feedback is no longer changing. All units then adapt their connections to the geniculate cells using a Hebbian-type learning-rule. This process of presenting inputs, settling, and adapting continues until the feedforward connections themselves settle to stable values.
2.2.2 Unit Model. Cortical cells are modeled with a membrane potential equation:
where C is the cell membrane capacitance, u is the membrane potential, u+ and u p are the reversal potentials for sodium and potassium, respectively, and ur is the resting potential. Conductances g+.g-.g' are for sodium, potassium, and leakage, respectively, and we assume that g and g- depend linearly upon summed excitatory and inhibitory input, respectively. The equation can be rewritten as v+g+ + z1-g- + v'gJ dV T= zrX - v where ux = dt (g+ + 8- + g') and +
r=
C (g' + 8- +
u X is an attractor to which the ambient voltage is drawn exponentially if conductances remain constant, and T is the cell time constant. Cells are treated as simple fixed-threshold relaxation oscillators that fire when the membrane potential reaches a threshold value vHand then reset their potential to uO.The firing ratef is given by
where r" is an absolute refractory period and t H is the time taken for the voltage to rise from uo to v'. The output f is plotted against gL and g , as shown in Figure 3; it has a fixed threshold determined by vr 2 71', approximates linearity when just above threshold, and saturates with a value l / r r if g+ >> g-. Excitatory and inhibitory conductances, 87 and
H. C;. Barrow, A. J. Bray, and J. M. L. Budd
1434
O u t p u t = F(Ge,Gi)
r
Figure 3: Unit output as a function of
+
x/-, are weighted cells:
g:
x
sums of stimulation from connecting LGN and cortical
c
iiikyk
k
+ czu,,x, I
and
g;
x
1~ I ! / I I
where y k is the output of excitatory cortical cell k, J I is the output of inhibitory cell I , x, is the output of LGN cell i, and ~ ~ ~ 11 ~ ,and 1 , z l / l are connection strengths between cortical cell j and excitatory cortical, inhibitory cortical and LGN cells, respectively. In our simulations we set C = 4.15, g' = 0.01, u+ = 55.0,11- = -85.0, i' = -70.0, ii" = -85.0, vH= -55.0, and T I = 0.1.
2.2.2 Adaptation. Intracortical connections have a fixed gaussian distribution, inhibition extending further than excitation. The adaptive feed-
Model of Color Blob Formation
1435
forward connections between cortical and geniculate cells are initialized with random positive values (from the uniform distribution 0.0. . . l.O), modulated by a gaussian function of distance, and normalized such that the sum of weights from the 6 types of geniculate cell connecting to any cortical cell is 6.0. Small fragments of image are presented to the model, network activity settles, and then the connections wci are adapted using a variant of the Hebbian learning rule:
where (P is a small learning constant ((I = 0.001), c is an index over both excitatory and inhibitory cortical cells, and p0 is a probability function maintaining a gaussian distribution of connection strength. 2.3 Simulations. Inputs were sampled from the two color images (resolution 320 x 200) in Figure 6. For computational efficiency, we applied retinal and LGN processing once throughout each image at the start of the simulation, storing the six arrays of LGN output. The free parameter G,, that determines spatial scale was set at 2 pixels.2 Subsequently, for each input presentation we extracted six 15x 15 subarrays from the stored arrays. This "window" was centered on a random location in the image, providing the output of 1350 LGN cells in total. The outputs of these cells were normalized so that the total amount of activity, summed over all channels, was constant (= 6.0). The cortex was modelled with an array of 30 x 30 excitatory units and a spatially coextensive array of 15 x 15 inhibitory units. All 1125 cortical units received feedforward input from the 1350 geniculate cells, each unit having its own connection weight to each geniculate input.3 The strength of intracortical feedback connections was a gaussian function of separation: the excitatory and inhibitory gaussians had standard deviations equal to 5 and 23% of the cortical space, re~pectively.~ 2.4 Results.
2.4.1 Geniculocovtical Coniiections. It can be seen from Figure 4 that most excitatory units develop similar connections in the broad-band, redgreen and green-red channels. The profiles of these connection strengths show two-dimensional (2D) orientation preference, with many of the pro2The ratios are such that G,,/3 = G, = GI, = G1,/2. %e gaussian probability function p that maintains an envelope on geniculocortical weights had a standard deviation of 3.5. "There were no wrap-around effects, but the sum of each unit's excitatory and inhibitory connections was independently normalized to have a set value.
1436
H. C . Barrow, A. J. Bray, and J. M. L. Budd
files resembling 2D Gabor functions (Daugman 1985). This is to be expected from the work of Linsker (1990), Miller (1990), and Barrow (1987). It is striking, however, that there are small clusters of cells embedded within this sea of oriented fields that are different. These cells code color
Model of Color Blob Formation
1437
information and their connections to the red-green and green-red geniculate cells appear to be center-only (though occasionally there appears to be an added center-surround component). Note particularly that their connections to the broad-band cells have almost zero strength. For the most part, these color-selective cells seem to be responsive to red-ness, although there is at least one region that appears to encode green-ness. The red-sensitive cells make strong connections to both the on-center redgreen and the off-center green-red geniculate cells (the green-sensitive cells connecting strongly to the other two types of color-opponent geniculate cells). Figure 5 shows the weight patterns for a small sample of the excitatory units (which develop weight patterns very similar to inhibitory units at the same cortical location). In this sample it is easy to distinguish between the cells that are coding for orientation and the nonoriented cells that are coding color. We also show some cells (more unusual) that exhibit less obvious profiles, and appear to combine both color- and orientationselective properties.
2.4.2 Average Receptive Fields. To analyze the receptive field properties of the units in the network it is necessary to go beyond studying the connections between geniculate and cortical cells in isolation. The receptive field of a cortical unit is a function of preprocessing in the retina and LGN, as well as intracortical feedback. Studying the adaptive weights gives insight into the average geniculate output that stimulates a cortical cell (since the weights tend toward a weighted average of the patterns to
Figure 4: Facing page. Connections between geniculate and excitatory cortical units. In the large display, connections between red-green geniculate cells and the excitatory cortical cells are shown. Connection strengths to the off-cells LG have been subtracted from those to the on-cells L& for each cortical unit. For most cells, connections display selectivity to orientation, and preferred orientation varies smoothly across the cortical area. Connections for such cells are similar in the broad-band and green-red channels to those shown here for the red-green. However, there are clusters of cells that tend to display no orientation preference. In the three smaller displays beneath we show connections for one such cluster in all three broad-band, red-green, and green-red channels (left to right). These cells tend to have very weak connections to all broad-band geniculate cells, while their connections to either "green-sensitive" or "red-sensitive" geniculate cells tend to be strong (with zero connections to cells of the opposing color), and have a center-surround profile. We suggest that such cells respond to"red-ness" or "green-ness," and receive little information from the broad-band channel.
1138
H. G. Barrow, A. J . Bray, and J. M. L. B~idd
Figure 5: Profiles of excitatorylgeniculate connections. Four profiles are shown for the connections of units in Figure 1. The left side of the profile shows the difference between broad-band on- and off-center connections; the middle shows a similar difference for the red-green channel, and the right for the green-red. Top left shows unit (11,lS) (whenever indexing into displays we use raster notation where the origin is top-left, the first number indexes right and the second indexes downwards) - a typical color-sensitive cell with nonorieiited connections to red-green on-center and green-red off-center cells, and very weak connections t o broad-band cells. Top right shows unit (10,6) - a typical orientation-selective cell; profiles in the three color channels are similar and resemble 2D Cabor-functions. Bottom left shows cell (18,26)- i t has similar center-surround profiles in all three color channels, suggesting it codes contrast regardless of orientation. Finally, the bottom right shows cell (25,17) - this appears to have combined properties of color-selecti1.ecells (strong connections to red-green on-center and green-red off-center) and contrast-sensitive cells (offcenter center-surround profiles). It seems from examination that although most cells fall easily into one of the two categories for color or orientation selectivity, there are a small number that have more complex profiles. These complexities highlight a need for different methods for determining their receptive fields.
which a unit responds). However, it does not tell us directly the retina1 input that will make the cell acti\,e. To discover the sort of patterns to which different cortical units were responding (once the network’s cortical connections were stable) we com-
Model of Color Blob Formation
1439
puted the average visual input that activated each unit. That is, we computed a weighted average of visual input for each cortical unit, where the weighting was determined by the output of the unit, after settling. Hence, if a unit responded strongly to a pattern then this pattern contributed a proportionately large amount to the average, whereas if it did not respond at all then it contributed nothing. Figure 6b displays weighted averages for the inhibitory units5 (behavior of the inhibitory units is almost identical to that of the excitatory ones, but since there are fewer of them patterns are easier to display). It is apparent that units are generally responding to either edges and bars regardless of color, chromatic boundaries, or orientation. In this example such color selection is strongly biased toward red stimuli.6 In another experiment, identical except that visual input was taken from a wider variety of images, an equal representation of red and green was found. 2.4.3 Artijiciaf Stimuli. As a final means of determining what sort of visual pattern activated the different cortical units we performed some limited "neurophysiological" experiments with artificial visual stimuli. We hypothesized that units outside the "blobs" would respond maximally to a luminance edge or bar of the optimal orientation, regardless of its color, whereas those units within blobs would respond maximally to a center-surround stimulus of the correct color contrast. Accordingly, we chose a few of the units illustrated in Figure 6b and attempted to find optimal stimuli for them. First, we created a visual stimulus, corresponding to a circular colored spot against a different colored background. We parameterized this stimulus by spot color, spot size, and background color. We then selected a unit we expected to be selective to color contrast only. We varied the visual stimulus until we found those values of the parameters that elicited the greatest response from that unit (in context of the whole network). We then created stimuli with the same radius while altering the spot and foreground colors. We found that the unit typically responded to these stimuli only if there was an appropriately colored center. That is to say, if the unit was from a red blob then a red component to the spot color was usually necessary; if from a green blob, a green center was required. Figure 7a and 7b shows the results of this process for two color-selective units, (1,12) and (11,14) in Figure 6. Second, we created a visual stimulus, corresponding to a black/white step-edge, parameterized by phase and orientation. We then selected a unit we expected to be selective for orientation only. We varied the visual stimulus until we found those values of phase and orientation that elicited the greatest response from that unit (in context of the whole 5T0 display these, we computed a single average color over all units and subtracted this color from the profile for each unit. 6The eigenvector analysis of the images used (in the next section) reflects this bias. 7Thk was done using a simple hill-climbing algorithm for optimization.
1440
H. G. Barrow, A. J. Bray, and J. M. L. Budd
Model of Color Blob Formation
1441
network). We then created stimuli with the same orientation and phase, while altering the color of the edge. We found that, regardless of edgecolor, these stimuli normally elicited a strong response from the unit. Figure 7c shows results for orientation-selective unit (4,l) in Figure 6. 3 A Linear Analysis
The model that we have presented reflects some of the details and nonlinearities of the primate visual system: retinal ganglion cells code for color contrast and luminance contrast, also separating information into on- and off-channels; LGN cells similarly have thresholds, and excitatory and inhibitory inputs may affect cell firing rates nonlinearly. However, it has been argued that the cortical cells we model do have a quasilinear response in their dynamic range (e.g., Stafstrom et al. 1984). It has also been argued that the goal that early visual processing is attempting to satisfy is to recode the input using a set of orthogonal, approximately linear basis functions, so compressing the input in a near-optimal manner (see Linsker 1990). In light of this, it is interesting to ask what is the optimal linear coding of the inputs that we present to our network. If the inputs, which are high-dimensional vectors, occupy only a subspace of the whole space, then we expect that the ”features” a quasilinear network such as ours learns will lie within this subspace, and be a function of its structure. 3.1 Determining the Principal Components. We computed the principal components for the set of inputs taken from the two images in Figure 6. We split the images into their three spectral bands, and generated the set of inputs to our network by extracting three 20 x 20 patches, one from each of the spectral images, for every position in the image (each position constituting an input). Each patch was modulated with
Figure 6: Facing page. Images, Receptive Fields and Eigenvectors. (top) Two color images (resolution320x200). (center)For each inhibitory cell in the cortical array a weighted average over many visual patterns is shown: the weighting is proportional to the response of the unit to that pattern. It is apparent that units are either selective to oriented edges or bars without preference for color, or selective for “red” input without preference for orientation. (bottom) The first sixteen principal components of all small patches (20 x 20) taken from these two images are shown, with their eigenvalues beneath. Most of the eigenvectors have very similar spectral components. However, there are some noticeable exceptions: the second eigenvector is unoriented and “red” and the sixth is unoriented and “green.” Later eigenvectors with smaller eigenvalues also show oriented red-cyan edges.
1442
H. G. Barrow, A. J. Bray, and J. M. L. Budd
Figure 7: Cell responses to artificial stimuli. (a, b) The colored centersurround stimuli that elicit maximal responses from the two inhibitory units (1,12) and (11,14) (see Figure 6) are shown on the left. Unit (1,12) strongly prefers red/magenta against a blue background, whereas unit (11,14) prefers green against a black background. In the tables on the right are shown the unit’s responses to stimuli with the optimal geometry, but with spot and background being any combination of white (W), red (R), green (G), blue (B), yellow (Y), magenta (M), or cyan (C) (rows determining center-color, and columns the surround-color). Precise interpretation is difficult, but it is apparent that unit (1,12) has a general preference for a red component in its center (i.e. red, yellow, or magenta) and unit (11,14) responds only if the central region has a green component (i.e. green or cyan). ( c )The black/white step-edge that elicited maximal response from the inhibitory unit (1,14) (see Figure 6) is shown on the left; the stimulus is parameterized by phase and orientation. The table on the right shows the unit’s response to stimuli of optimal orientation and phase, but with either the left (1) or right (r) side of the edge being black and the other being one of W, R, G , B, Y, M, or C. The responses to the 14 stimuli are shown: in cases where the direction of luminance contrast was correct, the unit was responsive regardless of color.
Model of Color Blob Formation
1443
a gaussian of ~7 = 5.0 pixels to remove directional bias and possible edge effects, and the three patches were composed to give a vector of 1200(= 3 x 20 x 20) elements. The two images of resolution 320 x 200 yield 108,962 such vectors and we found the principal components of this set by computing the eigenvectors of the corresponding covariance matrix (a matrix of 1200 x 1200). Finally, we took the 16 eigenvectors with largest eigenvaluess and reconstructed the color image patches that they represent. 3.2 Results. These 16 color eigenvectors, with their corresponding eigenvalues, are shown in Figure The first component is a DC luminance component; the second is unoriented, and has a strong positive red DC component;'O the third to fifth are low spatial-frequency contrast components without spectral differentiation; the sixth is unoriented, and has a strong positive green component; the remainder reflect higher spatial-frequency components, and generally show little spectral differentiation except for the eleventh and twelfth components, which are orthogonal red-cyan edges. From this we make the following conclusions: 0
0
0
0
Luminance: The largest variation in the data by far is that of general luminance, which is reflected in the first eigenvector. Contrast: A large amount of the variation in the data can be accounted for as variation along a small number of vectors having similar profiles in the three spectral subspaces. These eigenvectors seem to code luminance contrast independently of color and resemble oriented bars and edges, although higher spatial frequency contrasts are reflected in the many eigenvectors with small eigenvalues. A very small amount of variation can be accounted for as variation along vectors having different profiles in the spectral subspaces; these seem to code color contrast. Color: A major variation in the data can be accounted for by two vectors for which the variation depends upon general color but not spatial structure. General: It seems that the optimal linear code keeps color and contrast apart, bringing them together only to account for subtle variations in the data. The noncolor part of the code reflects the known power spectrum of natural images (Field 1987)
'The sum of the first 16 eigenvalues is over 90% of the sum of all eigenvalues. 'For display, g1 = 127.5 + (e, * 127.5/em,,) where i indexes into eigenvector e, g (0 < 8 5 255) is the spectral intensity, and em,, is the maximum absolute value in e. "'It should be borne in mind that all eigenvectors are orthogonal to one another; therefore, this vector with a positive red component must also have a negative cyan component (dominated by the red in the display) to maintain orthogonality to the first eigenvector.
H. G. Barrow, A. J. Bray, and J. M. L. Budd
1444
We carried out similar analyses for different images and smaller window sizes and found that window size made no significant difference to the eigenvectors and values. Different images yielded some small variation in eigenvectors, and the ordering of the vectors, reflecting different statistics. However, they supported the same general conclusions. 3.3 Interpretation. It is interesting to compare the differences between the linear code and what is known of the biological code, with reference to the network described aboiFe, and speculate upon reasons for the differences. First, the eigenvector analysis suggests that color coding, and some general luminance coding, should be DC, ignoring spatial contrast. However, in reality both color and luminance contrast are implemented at the earliest retinal stages in the mammalian visual system. For this reason we are particularly bad at judging either absolute luminance, or color in the absence of boundaries (see Land 1964). We suggest two possible speculative explanations why the biological design may have evolved to be as it is. One reason is that the principle of optimality guiding evolution may be different to that of maximizing information: an evolving organism is not interested in nll information but in that information that maximizes its survival chances. It may be that it is color and luminance contrast that is most useful for survival, and the DC component, if not filtered out early, would dominate and reduce fitness. Another reason might be that to code the DC component would require neurons to be accurate over such a huge range that their resolution would be poor; by ignoring the DC they can have an adaptive dynamic range with much higher resolution (this would be similar to the strategy adopted by foveal sampling of the image). Second, examination of the receptive fields in Figure 6 and Figure 7 seems to suggest many color-selective cells are interested in the overall color of their input, regardless of luminance or color contrast. This would be expected from the eigenvector analysis; however, it is not the same as the properties of cells in color blobs that exhibit both color and luminance contrast selectivity, and would be surprising since we modeled both types of contrast at the retina. However, in our model cortical cells can respond nnhy to input patterns containing luminance or color contrast"; these cells therefore respond not to "red-ness" per se but to red color borders, regardless of the orientation of the border or the color making up its other side.12 As such, they must be classified as red-center, broad-bandsurround in line with Figure 7a. We predict that a larger model, with cells having a larger inhibitory range, would result in increased response selectivity; for instance, cells might respond only to red-green borders at all orientations, and so become red-center green-surround. .
__ "Otherwise geniculate cells will have zero output. "It is this generality of response that leads to the cell's high metabolic rate
Model of Color Blob Formation
1445
4 Discussion
In the research reported here, we have been able to demonstrate that it is possible for color-blob type structures to develop in an artificial neural network purely through activity-induced adaptation in reponse to natural color images. Moreover, a single mechanism simultaneously develops unoriented color-selective cells and orientation-selectiveachromatic cells, with smooth variation of receptive fields across the cortical array. The key components of the model that are responsible for this behavior include the following: 0
0
0
0
Input from natural color images. The local statistics of images determine the types of receptive field developed by cortical units. Multichannel preprocessing of image data. The center-surround receptive fields of retinal and geniculate cells accentuate the importance of contrast and color contrast at boundaries in the image. Hebbian-type unsupervised adaptation of cortical units. This results in units with weight patterns related to the principal components of preprocessed image fragments. Lateral excitatory and inhibitory interaction within the cortex. The long-range inhibition implements competition among units, so that they do not all develop the same pattern of weights. The shortrange excitation causes neighboring units to develop similar weight patterns, so that receptive field properties vary smoothly across the array.
The combination of these components results in several of the characteristics of primate striate cortex. Moreover, related research suggests that the same mechanisms may also be responsible for ocular dominance stripes and retinotopic mapping (Miller et al. 1989; Miller 1990, 1994; Goodhill 1993). The apparent complexity and sophistication of cortical organization might be due primarily to the interaction of a basic set of mechanisms and architecture with the implicit structure of the sensory data. We should state firmly, however, that we do not claim that activityinduced adaptation is solely responsible for cortical organization. There is evidence, for example, that tropic mechanisms play a part in development of retinotopic maps (see Constantine-Paton et al. 1990). Even if cortical organization were discovered to be largely specified in detail genetically, there would still be the question of how the specification might have evolved. Our experiments hint at a possible answer. Suppose that at an early evolutionary stage the primary mechanism of organization of some characteristic was activity-induced adaptation. While the final result might confer an advantage on a mature individual, an immature one would be more vulnerable. If evolution could discover a process that would accelerate development, the species would become
H. G. Barrow, A. J. Bray, and J. M. L. Budd
1446
fitter. It is interesting to speculate that in some cases the plasticity of the nervous system might be the driving force in breaking new ground for a species, with genetic specification an optimization process following behind, rather than the other w a y round (see Hinton a n d Nowlan 1987).
Acknowledgments
~
Thanks to Jim Stone for various suggestions relating to this work, a n d detailed comments o n the paper. This work has been supported by a grant from the UK Science a n d Engineering Research Council a n d the Ministry of Defence.
References Barrow, H. G. 1987. Learning receptive fields. I € € € First Interiiatiom~lCoilfrrence 011 Neirrnl Nrts, IV, 115-121. Barrow, H. G., and Bray, A. J. 1992. Activity-induced "colour blob" formation. In Artifcinl Neural Netimrks [I: Pmeediiigs ofthe Interimtioiial Confireiice on Artificial Neirrd Net.tr~vks,I. Aleksander and J. Taylor, eds., Elsevier, Amsterdam. Carroll, E., and Wong-Riley, M. T. T. 1982. Light and e.m. analysis of cytochrome oxidase-rich zones in the striate cortex of squirrel monkeys. Soc.for Neirrosc. Absfr. 8, 706. Constantine-Paton, M., Cline, H. T., and Debski, E. 1990. Patterned activity, synaptic convergence, and the NMDA receptor in developing visual pathways. Annu. R e z Nuurosci. 13, 129-154. Daugman, J. G. 1985. Uncertainty relation for resolution in space, spatial frequency, and orientation optimised by two-dimensional visual cortical filters. 1. Opt. SOC.Anl. 2, 1160-1169. Erwin, E., Obermayer, K., and Schulten, K. 1995. Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neurul Comp. 7,425468. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. 1, Opt. Soc. Am. 4, 2379-2394. Goodhill, G. J. 1993. Topography and ocular dominance: A model exploring positive correlations. Biol. Cybrrii. 69, 109-118. Gouras, P. 1974. Opponent-colour cells in different layers of foveal striate cortex. 1.Physiol. (London) 238, 583-602. Hendrickson, A. E., Hunt, S. P., and Wu, J. Y. 1981. Immunocytochemical localization of glutamic acid decarboxylase in monkey striate cortex. Natiirr (London) 292, 605. Hinton, G. E., and Nowlan, S. J. 1987. How learning can guide evolution. Complex Syst. 1, 495-502. Horton, J. C. 1984. Cytochrome oxidase patches: A new cytoarchitectonic feature of monkey visual cortex. Phil. Transact. Ro!yal Soc. London (Biol.) 304, 199-253.
Model of Color Blob Formation
1447
Horton, J. C., and Hubel, D. H. 1981. Regular patchy distribution of cytochrome oxidase staining in primary visual cortex of macaque monkey. Nature (London) 292, 762-764. Hubel, D. H., and Wiesel, T. N. 1968. Receptive fields and functional architecture of monkey striate cortex. ]. Physiol. 195, 215-243. Kuljis, R. O., and Rakic, I? 1990. Hypercolumns in primate visual cortex can develop in the absence of cues from photoreceptors. Proc. Natl. Acad. Sci. U.S.A. 87,5303-5306. Lachica, E. A., Beck, P. D., and Casagrande, V. A. 1992. Parallel pathways in macaque striate cortex: Anatomically defined columns in layer 111. Proc. Nut/. Acad. Sci. U.S.A. 89, 3566-3570. Land, E. H. 1964. The retinex. Sci. A m . 52, 247-264. Linsker, R. 1990. Self-organization in a perceptual system: How network models and information theory may shed light on neural organization. In Connectionist Modeling and Brain Function: The Developing Interface, s. J. Hanson and C. R. Olson, eds., pp. 351-392. MIT Press, Cambridge, MA. Livingstone, M. S., and Hubel, D. H. 1984. Anatomy and physiology of a color system in the primate visual cortex. J. Neurosci. 4, 309-356. Malach, R. 1992. Dendritic sampling across processing streams in monkey striate cortex. J. Comp. Neurol. 315, 303-312. Michael, C. R. 1981. Columnar organization of color cells in monkey’s striate cortex. ]. Neurophysiol. 46, 587404. Miller, K. D. 1990. Correlation-based models of neural development. In N e w roscience and Connectionist Theory, M. A. Gluck and D. E. Rumelhart, eds., pp. 267-353. Lawrence Erlbaum Associates, Hillsdale, NJ. Miller, K. D. 1994. A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between ON- and OFF-center inputs. ]. Neurosci. 14, 409-441. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Nealy, T.A,, and Maunsell, J. H. R. 1994. Magnocellular and parvocellular contributions to the responses of neurons in macaque striate cortex. ]. Neurosci. 14, 2069-2079. Purves, D. , and Lamantia, A. 1993. Development of blobs in the visual cortex of macaques. J. Comp. Neurol. 334, 169-175. Robson, J. G. 1983. Frequency domain visual processing. In Physical and Biological Processing of Images, 0.J. Braddick and A. c. Sleigh, eds. Springer-Verlag, New York. Stafstrom, C. E., Schwindt, P. C., and Crill, W. E. 1984. Repetitive firing in layer V neurons from cat neocortex in vitro. J. Neurophysiol. 52, 264-277. Trusk, T. C., Kaboord, W. S., and Wong-Riley, M. T. T. 1990. Effects of monocular enucleation, tetrodotoxin, and lid suture on cytochrome oxidase reactivity in supragranular puffs of adult macaque striate cortex. Visual Neiiuosci. 4, 185-204. Ts’o, D. Y., and Gilbert, C. D. 1988. The organization of chromatic and spatial interactions in the primate striate cortex. 1. Nezrrosci. 8, 1712-1727.
H. G. Barrow, A. J. Bray, and J. M. L. Budd
1448
Malsburg, C. von der 1973. Self-organisation of orientation sensitive cells in the striata cortex. Kyberwtik 14,85-100. Wehmeier, U., Dong, D., Koch, C., and Van Essen, D. 1989. Modelling the mammalian visual system. In Matliods iiz Neuronnl Modelling, C . Koch and I. Segev, eds. MIT Press, Cambridge, MA. Wiesel, T. N., and Hubel, D. H. 1966. Spatial and chromatic interactions in the lateral geniculate body of the rhesus monkey. 1. Neim,di!/siol. 29, 1115-1156. Wong-Riley, M. T. T. 1979. Changes in the visual system of monocularly sutured or enucleated cats demonstrable with cytochrome oxidase histochemistry. Brniii Rrs. 171, 11-28. Wmg-Riley, M. T. T. 1994. Primate visual cortex: Dynamic metabolic organization and plasticity revealed by cytochrome oxidase. In C c r t h d Cortt-s V d . 10, Priiiiory Viwnl Cortex iiz Pririzotes, A. Peters and K. S. Rockland, eds., pp. 141-200. Plenum Press, New York. Wong-Riley, M. T. T., and Carroll, E. W. 1984. The effect of impulse blockage on cytochrome oxidase activity in the monkey visual system. Notiire ( L o r i h ) 307, 262-264. ~~~
-
Received September 14, 1994, accepted February 9, 1996.
This article has been cited by:
Communicated by David Heeger
Functional Consequences of an Integration of Motion and Stereopsis in Area MT of Monkey Extrastriate Visual Cortex Markus Lappe Department of General Zoology and Neurobiology, Ruhr-University-Bochum, 0-44780 Bochum, Germany
Experimental evidence from neurophysiological recordings in the middle temporal (MT) area of the macaque monkey suggests that motionselective cells can use disparity information to separate motion signals that originate from different depths. This finding of a cross-talk between different visual channels has implications for the understanding of the processing of motion in the primate visual system and especially for behavioral tasks requiring the determination of global motion. In this paper, the consequences for the analysis of optic flow fields are explored. A network model is presented that effectively uses the disparity sensitivity of MT-like neurons for the reduction of noise in optic flow fields. Simulations reproduce the recent psychophysical finding that the robustness of the human optic flow processing system is improved by stereoscopic depth information, but that the use of this information depends on the structure of the visual environment. 1 Introduction Visual tasks are often defined with reference to only one specific visual modality, such as, for instance, motion. But in natural situations, biological vision systems usually have access to a number of additional visual or extraretinal signals that could support each other’s functionality and add to solving the task. This has traditionally been largely ignored by many computational vision schemes, which focused on the understanding of specific vision mechanisms. But now a merging of different modalities or visual maps is receiving more attention in the computer vision and robotics communities. It is referred to as ”sensor fusion” (Clark and Yuille 1990). In this paper, a biologically plausible neural model of the functional consequences of an integration of motion and stereopsis is presented, which uses depth information to increase its robustness against noise in a visual navigation task. For visual navigation in an unknown environment, the optic flow field has long been considered a major source of information (Gibson 1950). In the case of a general self-motion in a rigid environment, i.e., observer translation and rotation with respect to a static scene, the task of heading Neirrnl Conzputatioti 8, 1449-1461 (1996) @ 1996 Massachusetts Institute of Technology
1450
Markus Lappe
detection from optic flow mathematically involves solving a large number of equations in a large number of unknowns (Koenderink and van Doorn 1987): Besides the observer’s rotation and translation direction, the distances of the visual objects in the scene are also unknown. An optimization scheme implemented in a neural network can solve this task within the psychophysically measured limits of human observers. This network shares a number of features with the known properties of neurons in primate extrastriate areas MT and MST (Lappe and Rauschecker 1993, 1995). However, it has also been demonstrated that human heading detection is heavily influenced by other visual modalities, most notably extraretinal eye movement signals (Warren and Hannon 1990; van den Berg 1993; Lappe ct nl. 1994). But the problem of heading detection from optic flow could also be simplified when the depths of the visible objects were explicitly known (Ballard and Kimball 1983), signaled for instance by the stereoscopic system. Mathematically, the optic flow field is a function of the observer’s translation, his (eye) rotation, and the distances of the visible objects. Prior knowledge of any of these parameters would simplify the determination of the rest. Indeed, a recent psychophysical study was the first to show an effect of disparity on heading judgments in that the robustness against noise is strongly increased when optic flow fields are presented stereoscopically (van den Berg and Brenner 1994b). This poses the question of the neuronal mechanisms that support this implicit use of stereoscopic depth. The aim of this work is to investigate whether the recently obser\.ed disparity dependence of MT neurons could serve this function. 2 Neurophysiological Findings of Disparity Sensitivity in Area MT
It ha5 been known for some time now that motion-selective neurons in area MT of the macaque monkey also exhibit broad disparity selectivity, but are insensitive to motion in depth (Maunsell and Van Essen 1983a). But in a recent study, Bradley et 01. (1995) found an interesting specific disparity dependence of MT responses to transparent motion. Previously it was demonstrated (Snowdon etal. 1991) that the response of an MT cell to the motion of random dots in the cell’s preferred direction is strongly reduced when a second, transparent dot pattern moves in the opposite direction. Bradley et 01. (1995) now showed that in most neurons, this response reduction occurs only when the disparity difference between the two countermoving dot patterns is within a certain limited range. When both patterns are clearly separated in depth, no response reduction is observed. This property of MT neurons might serve as the basis for the increased robustness against noise when optic flow stimuli are presented stereoscopically to human subjects. For optic flow fields simulating self-
Integration of Motion in Area MT of Monkey
1451
motion in a static, structured environment, visible objects close to each other in space usually give rise to similar optical velocities while objects separated in depth move at different optical velocities. Thus, a spatial averaging of the visual motion signals within a restricted disparity range might improve the representation of the optic flow field in area MT, and provide an enhanced, noise-reduced input to optic flow processing neurons in the medial superior temporal (MST) area. The structure of the optic flow representation in MT is an important parameter for the modeling of system capabilities of heading detection (Lappe and Rauschecker 1995). In the following, the consequences of the disparity dependence of the motion signal averaging in MT for the flow field analysis are explored. To this end, a simple functional model of the integration of motion and stereopsis for the task of determining heading will be used. The main concern of this model is the representation of the optic flow field in area MT, taking into account the observed disparity dependence. To evaluate its implications for the determination of selfmotion, presumably taking part in area MST, a heading detection scheme developed earlier (Lappe and Rauschecker 1993) is adopted. 3 Visual Computation of Heading
The algorithm for heading detection determines the most likely direction T by minimizing a certain residual function R(T) (Heeger and Jepson 1992). The neural implementation of this scheme solves the minimization by determining R(T) for various candidates T,, and then choosing the optimum Tj by a winner-take-all mechanism. This computation involves only two layers of neurons. The first layer forms a representation of the optic flow input. This representation shall be based on the properties of MT neurons, and will be presented in more detail later. Its basic structure consists of sets of motion-selective neurons with different preferred velocities en,and velocity tuning functions s,,,, which are assumed to form a population encoding of an optic flow vector 8:
6
=
Cm Snieni
(3.1)
These direction-selective neurons connect to a second layer, which contains cell populations that implement the computations necessary to determine R(T,) and become maximally excited when R(T,) = 0. The peak of neuronal population activity in this layer signals the best matching direction of heading (Lappe and Rauschecker 1993). 4 Functional Model of the Representation of Motion and Stereopsis in Area MT
Our investigation here is concerned with the functional consequences of the specific combination of motion and disparity signals in area MT
Markus Lappe
1452
that has been found experimentally. For this reason, the question of how this combination is generated from the inputs of visual processing stages preceding area MT (Qian 1994; Wilson and Kim 1994; Nowlan and Sejnowski 1995) is not explicitly considered. Rather, a simple functional model of the representation of the flow field in area MT, serving as the input to the heading detection stage, is introduced. 4.1 Distributed Representation of Velocity. Most MT neurons are tuned for speed and direction (Maunsell and Van Essen 1983b). In single neurons, the speed and direction tuning are independent of one another (Rodman and Albright 1987). Here, the direction tuning is assumed to follow a rectified cosine function. A neuron’s direction-specific response to a movement into direction Q, is
The speed tuning is modeled as a gaussian of the logarithm of the ratio between actual speed and preferred speed up: zj
S,pt.ed(t’)
exp
{
-
[1og,(u/upjlz}
(4.2)
the response s of the neuron is S(Z1.
(5) = s5~,ccd(z~jsd,r(d)
(4.3)
The responses of groups of neurons are used to form a distributed representation of visual motion. In the simulations, four equally spaced direction preferences, (I),, = 7~k/2.k = { l . ,. . .4},and eight speed preferences, up = 2’deg/sec. I = (-1.. . . .6} are used. Preferred speeds between 0.5 and 64 deg/sec are within the range of preferred speeds in MT (Maunsell and Van Essen 1983b). A distributed representation of velocity is obtained by summating the neuronal activities weighted by the speed and direction preferences of the neurons: (4.4) 4.2 Spatial Integration at Different Scales by Extended Receptive Fields. Instead of using a single flow vector as input for an individual neuron, the two-dimensional spatial integration provided by the extended receptive field of the cell is incorporated first. It is assumed that the total response for a single neuron i is obtained by averaging its responses sAl(u,, (/+) to all flow vectors j that fall inside its receptive field R:
Integration of Motion in Area MT of Monkey
1453
Such an averaging over the response distributions has several properties that approximate the spatial integration performed by MT cells. Similar to MT cells (Britten ef al. 1993), the response of an individual neuron is a monotonic function of the amount of correlated motion inside its receptive field. Responses are maximal for 100% correlated motion into the preferred direction and reduce to a medium level for 0% correlated motion. Further response reduction is obtained for correlated motion into the null direction. Transparent motion in preferred and null direction elicits a response of 50% of the maximum response, also similar to MT cells (Snowdon et al. 1991). The area of spatial integration for a specific neuron is given by the size of its receptive field. In the visual motion pathway of primates, the receptive field sizes increase from V1 to MT to MST. Within each area, receptive field size is a function of retinal eccentricity F of the receptive field center, and usually increases toward the periphery of the visual field. In MT, the average size of the receptive field and its dependence on eccentricity are empirically described by Albright and Desimone (1987):
+
RFSize = 1.04deg 0.616
(4.6)
The increase of the receptive field size with retinal eccentricity is also a useful property for the reduction of noise in optic flow fields that arise from self-motion. Typically, during self-motion the singular point of the optic flow field is near the center of the visual field (Lappe and Rauschecker 1995). Therefore, the center of the visual field contains many different local motion directions that are important for the analysis of the flow field. In contrast, in the periphery the flow becomes more lamellar, allowing spatial averaging over a larger scale without losing too much information about the local motion directions. 4.3 Disparity Dependence of the Spatial Integration within the Receptive Field. During self-motion, the optical velocity of a visible object depends on the distance of the object from the observer. Objects close to each other in space generate similar visual motion. Objects separated in depth result in different optical velocities. This ”motion parallax” affects only the component of the optic flow field that is due to the translation of the observer, not the component due to rotation of the observer’s eye or head. It provides a major cue for the visual system to differentiate both components and to correctly perceive the direction of self-motion (Warren and Hannon 1990). Averaging motion signals from different depths removes this very important cue. The next step therefore involves simulating the specific disparity dependence of the MT responses. To this end, the spatial averaging within the receptive field is weighted by disparity. In its simplest form, this weighting can be implemented as a cutoff at an upper disparity limit D. For each flow vector inside the receptive field, the disparity is compared to the disparity of the flow vector in the receptive field center, which
1454
Markus Lappe
serves as a reference value or a preferred disparity of the cell. Then, if the disparity difference 6 is less than D, the motion signal of this flow vector contributes to the spatial averaging, otherwise it is excluded from the calculation of the neuron’s response. Thus, when adding disparity information in the representation of the flow field in MT, the response of a single neuron is given by
(4.7) instead of equation 4.5. The choice of a reference value for each neuron can be effectively interpreted as a winner-take-all selection within an ensemble of neurons with identical receptive fields but different preferred disparities. In this view, sets of disparity and velocity tuned neurons determine estimates of average velocity within defined disparity bands. Then a selection mechanism identifies the disparity band that corresponds to the disparity in the receptive field center. The averaged velocity signal of this disparity band is transmitted to the subsequent heading detection stage. A more elaborate population encoding could use the activities in several disparity bands to determine an estimate of the disparity itself, similar to the encoding of speed and direction of motion. However, the actual value of the disparity is not explicitly used in the heading detection scheme. Thus, for the purpose of the present work the simple disparity weighting seems sufficient. Figure 1 summarizes the structure of the assumed representation of velocity in MT. The above procedure results in an improved representation of the flow field in the presence of noise (Fig. 2). The following section explores the consequences of this representation for the heading detection system. 5 Results
The network was tested with simulations of the psychophysical experiments by van den Berg and Brenner (1994b). Self-motion with respect to a three-dimensional cloud of random dots or a ground plane was simulated. The flow fields contained additional eye rotation appropriate to track a point in the environment (Lappe and Rauschecker 1995). Noise was added to the flow field in the following way: Each flow vector was disturbed by a noise vector, the direction of which was taken at random from a uniform distribution over the interval [ 0 . 2 ~ ] .The magnitude of the noise vector was proportional to the magnitude of the flow vector. The proportionality constant defined the signal-to-noise ratio SNR. When such flow fields were presented to human subjects, either stereoscopically, preserving the three-dimensional layout of the scene, or synoptically, without stereoscopic depth information, van den Berg and Brenner found that the results in the heading detection task depended on
Integration of Motion in Area MT of Monkey
1455
Heading -B
detection stage (MST)
MT layer
-B
b
Flow field
so rn
I
5
I
I
visual direction
1
+
Figure 1: Schematic representation of the flow field computations assumed to take place in areas MT and MST of monkey extrastriate cortex. The optic flow field consists of the motion vectors of visible points located in various distances from the observer. The MT layer is a retinotopic map of visual motion. Each map position contains ensembles of neurons with different preferred velocities and preferred disparities. Each neuron averages motion from within a restricted spatial receptive field and a restricted disparity band. Receptive field sizes grow with eccentricity of the receptive field center. The averaged motion signal from the disparity band that corresponds to the disparity of the flow vector in the receptive field center is then fed into a biologically plausible heading detection scheme presumably located within area MST. Details of the heading detection stage can be found in Lappe and Rauschecker (1993).
scene geometry. For the cloud, heading errors increased with decreasing SNR, but were always much lower in the stereoscopic as compared to the synoptic condition. For the ground plane, little difference between the two presentation conditions was observed, and only a modest variation with SNR occurred.
1456
Markus Lappe __
___.
.- ................. --_ _ _ .............. . _. ~_ . . . . ............... ... ............... ... ,,. <.\\,.,> \. . \... ... \. . \.*...>\\a. .... .... .:....,\.?.:>\:. .:.'. \. \\I-,
-
..............
*.-
I
<
*
.
.
.
,
I
% \,,
\ . .
.
,, ..
.,
....
.
~ I
-_-.......... . ,
.,/,
,
,
:*.-_
. .
. . . . . . .
...........
-;--*>.i
..........
,
..
/ . .,. 1
.
. ,
.
.... ......... ~
_:_._
Figure 2: Robust representation of the flow field in the presence of noise by MT neurons. Top left: Optic flow field for movement through a cloud of random dots (see Section 5 for parameters). Because of the random depth distribution, there is no apparent global structure, but the flow field is noise-free. Top right: The same flow field with added noise (SNR = 2). Bottom left: Effect of spatial averaging of the flow field disregarding disparity. This heavily smoothed flow field lacks much of the original motion parallax information. Bottom right: Representation in MT when the disparity dependence is accounted for. Spatial averaging within a restricted depth range results in significant noise reduction.
The stereoscopic and synoptic conditions were recreated in model simulations. For the stereoscopic condition, the disparity range over which an individual neuron spatially averaged the motion signals within its receptive field was set to an intermediate value of D = 0.4 deg, which is the range suggested by the physiological data. For the synoptic condition, the spatial averaging was performed for all motion vectors regardless of disparity. In accordance with the psychophysical experiments, visual
Integration of Motion in Area MT of Monkey
1457
field size was set to 54 by 54 deg. Simulated translational speed was 1.5 m/sec. For the cloud stimulus, eye rotation appropriate to track a point 8 m away from the observer was simulated. Depth of the cloud ranged from 2 to 20 m. For the ground plane, eye rotation was appropriate to track a point on the plane, the depth of which depended on the simulated heading and on the simulated eye level (0.65 m). As in the experimental conditions of van den Berg and Brenner only the horizontal component of the computed self-motion was evaluated. The input layer of the network consisted of 18,432 neurons arranged on a 24 by 24 grid. Each grid position contained 32 neurons, 8 x 4 speed and direction preferences. The output layer consisted of 14,440 neurons arranged on a 19 by 19 grid of heading directions covering the central 40 by 40 deg of the visual field. Each grid position contained 40 neurons forming a population encoding of the direction of heading (see Lappe and Rauschecker 1993 for details). The results of the simulation are shown in Figure 3. Each data point is the average of 100 simulation runs. Similar to the results of van den Berg and Brenner, the robustness of the model depends on both, the simulated viewing condition and the geometry of the scene. For the ground plane, mean errors for both viewing conditions are roughly equivalent, down to an SNR of 2. Errors vary moderately, as SNR decreases. The results are different for the cloud. There, the errors depend strongly on SNR, but in all cases the errors in the stereoscopic condition are much lower than in the synoptic condition. Also similar to the results of van den Berg and Brenner, the errors in the simulated stereoscopic condition are similar for both environments down to an SNR of 2. Taken together, the results show that the model draws on stereoscopic information in the case where it is most needed, namely for movement in a cluttered, noisy environment. These results suggest that the lack of differences in the responses of the human subjects in the he ground plane condition stems from the smooth depth variations in this stimulus. In the cloud condition, dots within the receptive field of any given cell could show very large disparity differences. In the ground plane condition many dots within a receptive field have similar disparities, because the distances change smoothly from one point on the ground plane to the next. In this case, limiting the spatial integration to a certain disparity range has little effect on the choice of motion signals that contribute.
6 Discussion
A simple model of the functional integration of motion and stereopsis for the task of visual heading detection was presented. This model incorporates many important features of neurons in the visual motion pathway of primates. In simulations, it reproduces the psychophysically observed dependencies of the human heading detection system. Stereo-
1458
Markus Lappe
F] +stereo
4
7-
1
7
7----7--
2
3
4
SNR
1
--e stereo
synoptic1
Figure 3: Mean heading errors as a function of the signal-to-noise ratio (SNR) for simulated stereoscopic and synoptic flow fields. For movements through a three-dimensional cloud of dots (upper panel), the model is much more robust against noise in the simulated stereoscopic condition. For movements over a ground plane (lower panel) there is little difference between the two conditions. Filled circles are simulation results for the stereoscopic condition. Filled triangles are simulation results for the synoptic condition. Human data for the corresponding situations are shown by open symbols. Open circles and open triangles in the cloud condition show the results of three individual subjects of van den Berg and Brenner (1994b) in the stereoscopic and synoptic conditions, respectively. For the plane condition, only data for two of these subjects in the synoptic case were available from an earlier paper (van den Berg and Brenner 1994a). These data are shown by open triangles.
Integration of Motion in Area MT of Monkey
1459
scopic depth information is used to support the two-dimensional motion information and to reduce flow field noise. This approach results in an increased robustness for movements in a cluttered environment. Consistent with the human data, there is no advantage to using stereo in smooth environments such as the ground plane. The use of disparity information in the model is an implicit one. Depth does not directly contribute to the computations involved in determining the direction of heading. Rather it is used only to enhance the representation of the flow field. This is in line with the psychophysical results, since the experiments by van den Berg and Brenner (1994b) provide evidence that it is not the motion in depth but the relative depth of the scene that is used by human subjects. The model is concerned with the features of the flow field representation in area MT used as input to a later heading detection stage. The simulation results show that these features can have a profound influence on the performance of the whole system. In modeling the input representation more closely, one can thus expect to gain more insights into the functioning of the system. It needs to be emphasized that the presented network is not intended as a model of how the observed features of MT neurons are generated. Within the scope of this work, only the functional properties of MT neurons have been used. The question of how the transparent motion detection and disparity tuning of MT neurons can be achieved given the sensory inputs from the retina is a different problem that has already received much consideration on its own (Qian 1994; Wilson and Kim 1994; Nowlan and Sejnowski 1995). However, the presented work shows that the functional consequences of these properties are consistent with the requirements of the human heading detection system. This complements a number of other features of area MT which are also beneficial for the representation of self-motioninduced optic flow fields. These features include the increase of preferred speed and receptive field size with retinal eccentricity and the predominance of centrifugal direction preferences in the peripheral visual field (Lappe and Rauschecker 1995). For future research, this poses the interesting question of whether even more visual modalities, such as color, or higher level non-Fourier motion signals, could also provide additional supportive information for this task (Braddick 1995).
References Albright, T. D., and Desimone, R. 1987. Local precision of visuotopic organization in the middle temporal area (MT) of the macaque. Exp. Brain Xes. 65, 582-592. Ballard, D. H., and Kimball, 0. A. 1983. Rigid body motion from depth and optical flow. Comp. Vis. Graph. lmage Pro. 22, 95-115.
1460
Markus Lappe
Braddick, 0. 1995. Visual perception: Seeing motion signals in noise. Curr. Biol. 5, 7-9. Bradley, D., Qian, N., and Andersen, R. 1995. Integration of motion and stereopsis in middle temporal cortical area of macaques. Notwe (Lorzdoir) 373, 609411. Britten, K. H., Shadlen, M. S., Newsome, W. T., and Movshon, J. A . 1993. Responses of neurons in macaque MT to stochastic motion signals. Vis. N u V O S C ~ .10, 1157-1169. Clark, J. J., and Yuille, A. L. 1990. Dotn Firsiorifor Sriisory Zrzfornintioiz Processing S!ystrrrts. Kluwer, Boston, MA. Gibson, J. J. 1950. The Perceytror~oftlit7 Visiral World. Houghton Mifflin, Boston. Heeger, D. J., and Jepson, A. 1992. Subspace methods for recovering rigid motion I: Algorithm and impIementation. I. /. Conipt. Msiorz 7(2), 95-117. Koenderink, J . J . , and van Doorn, A. J. 1987. Facts on optic flow. B i d . Cybern. 56, 247-254. Lappe, M., and Rauschecker, J . P. 1993. A neural network for the processing of optic flow from ego-motion in higher mammals. Neirrnl Corrip. 5, 374-391. Lappe, M., and Rauschecker, J. P. 1995. Motion anisotropies and heading detection. Biol. Cyberrz. 72, 261-277. Lappe, M., Bremmer, F., and Hoffmann, K.-P. 1994. How to use non-visual information for optic flow processing in monkey visual cortical area MSTd. In ZCANN 94 - Prortwiiri~sof tlw lritcriiatiorinl Cor!fertvicr oii Artificial Neural Nt7ticlorks, M. Marinaro and I? C . Morasso, eds., pp. 4649. Springer, Berlin. Maunsell, J. H. R., and Van Essen, D. C. 1983a. Functional properties of neurons in middle temporal \visual area of the macaque monkey. 11. Binocular interactions and sensitivity to binocular disparity. /. Nmroplysiol. 49(5), 114-1167. Maunsell, J. H. R., and Van Essen, D. C. 1983b. Functional properties of neurons in middle temporal visual area of the macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. /. Ntwroph!/sio/. 49(5), 1127-1147. Nowlan, S., and Sejnowski, T. 1995. A selection model for motion processing in area MT of primates. J . Neirrosci. 15, 1195-1214. Qian, N. 1994. Computing stereo disparity and motion with known binocular cell properties. Neirrol Coriiy. 6, 390304. Rodman, H. R., and Albright, T. D. 1987. Coding of visual stimulus velocity in area MT of the macaque. Vis. Rrs. 27( 12), 2035-2048. Snowdon, R. J., Treue, S., Erickson, R., and Andersen, R. A. 1991. The response of area MT and V1 neurons to transparent motion. 1. Nriirosci. 11(9), 27682785. i i ) 497498. van den Berg, A. V. 1993. Perception of heading. Natitrti ( L c ~ i i d ~ 365, van den Berg, A. V., and Brenner, E. 1994a. Humans combine the optic flow with static depth cues for robust perception of heading. Vis. Rcs. 34, 2153-2167. van den Berg, A. V., and Brenner, E. 1994b. Why two eyes are better than one for judgements of heading. Nottrrp f L oridtiii) 371, 700-702.
Integration of Motion in Area MT of Monkey
1461
Warren, W. H., Jr., and Hannon, D. J. 1990. Eye movements and optical flow. 1. Opt. SOC.Am. A 7(1), 160-169. Wilson, H., and Kim, J. 1994. A model for motion coherence and transparency. Vis. Neurosci. 11, 1205-1220.
Received September 27, 1995; accepted March 19, 1996
This article has been cited by: 2. Karl Pauwels, Markus Lappe, Marc M. Hulle. 2007. Fixation as a Mechanism for Stabilization of Short Image Sequences. International Journal of Computer Vision 72:1, 67-78. [CrossRef] 3. George J. Andersen, Asad Saidpour. 2002. Necessity of spatial pooling for the perception of heading in nonrigid environments. Journal of Experimental Psychology: Human Perception and Performance 28:5, 1192-1201. [CrossRef]
Communicated by Peter Foldiak
Learning Perceptually Salient Visual Parameters Using Spatiotemporal Smoothness Constraints James V. Stone' School of Cognitive and Computing Sciences, University of Sussex, Sussex BN1 9QH, U K
A model is presented for unsupervised learning of low level vision tasks, such as the extraction of surface depth. A key assumption is that perceptually salient visual parameters (e.g., surface depth) vary smoothly over time. This assumption is used to derive a learning rule that maximizes the long-term variance of each unit's outputs, whilst simultaneously minimizing its short-term variance. The length of the half-life associated with each of these variances is not critical to the success of the algorithm. The learning rule involves a linear combination of anti-Hebbian and Hebbian weight changes, over short and long time scales, respectively. This maximizes the information throughput with respect to low-frequency parameters implicit in the input sequence. The model is used to learn stereo disparity from temporal sequences of random-dot and gray-level stereograms containing synthetically generated subpixel disparities. The presence of temporal discontinuities in disparity does not prevent learning or generalization to previously unseen image sequences. The implications of this class of unsupervised methods for learning in perceptual systems are discussed. 1 Introduction
The ability to learn perceptually salient visual parameters - surface orientation, curvature, depth, texture, and motion -is a prerequisite for the more familiar tasks (e.g.object recognition, catching prey) associated with biological vision. This paper addresses the question: What strategies enable neurons to learn these parameters from a spatiotemporal sequence of images, without the aid of an external teacher? According to Gibson (1979), the problem of vision consists of obtaining invariant structure from continually changing sensations. Whereas Gibson's intuitively appealing approach has been well received by perceptual psychologists, the lack of a detailed theory has ensured that this approach has received little empirical vindication from computer vision. 'Present address: Psychology Building, Western Bank, University of Sheffield, Sheffield S10 2lT, UK.
Neural Computation 8, 1463-1492 (1996) @ 1996 Massachusetts Institute of Technology
1464
James V. Stone
However, the potential of Gibson’s ideas has recently begun to be realized as a series of connectionist models (Phillips, et al. 1995; Becker and Hinton 1992; Becker 1992; Zemel and Hinton 1991; Foldiak 1991; Schraudolph and Sejnowski 1991; Mitchison 1991). In particular, the IMAX models (Becker and Hinton 1992; Zemel and Hinton 1991; Becker 1992, 1996)have been instrumental in drawing attention to the possibility of learning perceptually salient parameters using unsupervised learning methods. The model described in this paper is different from these models, although it shares with them a common assumption: models of perceptual processes can be derived from an analysis of the types of spatial and temporal changes immanent in the structure of the physical world. For example, a learning mechanism might discover perceptually salient visual parameters by taking advantage of quite general properties (such as spatial and temporal smoothness) of the physical world. These properties are not peculiar to any single physical environment so that such a mechanism should be able to extract a variety of perceptually salient parameters (e.g., three-dimensional orientation and shape) via different sensory modalities (vision, speech, touch), and in a range of physical environments. 2 Unsupervised Learning of Visual Parameters
Learning in artificial neural networks consists of two broad classes, supervised and unsupervised. Supervised learning requires access to a vectorvalued error signal (as in backpropagation) and therefore may not considered as biologically plausible [although reinforcement learning (Sutton 1988) via scalar-valued error signals is clearly more realistic]. Unsupervised learning methods perform a type of cluster analysis on a given set of inputs. However, almost all unsupervised methods form clusters on the basis of only the low order statistics of their inputs. Accordingly, in the absence of hand-crafted architectures (Linsker 1988), networks that perform unsupervised learning tend to “discover” parameters that are linear functions of their inputs (e.g., q a 1982). The data compression of inputs performed by linear systems reduces the redundancy of the transformed input data. While such a process might be considered to be desirable during the early stages of perceptual processing (Barlow 1972), it is not obvious how such a mechanism could give rise to the complex response characteristics typical of neurons in visual area V2 and in the inferotemporal cortex. These “high order” neurons have outputs that respond selectively to disparity(V2) (Poggio et a / . 1985), facial expression (Heywood and Cowey 1992), and even ”moving light displays”2 (Perret et al. 1990) of human walkers, which are defined ’Typically, a light is attached to each major joint of a moving person in a darkened room, so that only the motion of the joints is visible (Johansson 1973).
Learning Visual Parameters Using Smoothness Constraints
1465
principally in terms of their spatiotemporal characteristics (Perret rf al. 1990). The response properties of such neurons cannot be modeled using linear systems, unless the input representation is hand-crafted to ensure that inputs are linearly separable. While such an approach may be fruitful for restricted domains, it is unlikely to yield solutions of general utility. Both Mitchison (1991) and Schraudolph and Sejnowski (1991) make use of an anti-Hebbian learning rule in an explicit attempt to minimize the variance of the outputs of units. The authors aim to discover invariant properties of the input data. Mitchison’s model is linear, which restricts it to computing linear functions of its inputs. The network described in Schraudolph and Sejnowski (1991) is nonlinear, and appears to benefit from hierarchical learning through successive layers of units. However, the methods described in Mitchison (1991) and Schraudolph and Sejnowski (1991) both require weight normalisation. In Foldiak (1991) and Barrow and Bray (1992), a temporal trace mechanism is used with a Hebbian rule to discover temporally related inputs. Each unit becomes sensitive to lines at a particular orientation, but the precise position of these lines is ignored. However, as pointed out by Becker (1992 p. 363), the input representation used in Foldiak (1991) ensures that lines at different orientations are linearly separable sets of vectors. This is also true of the work described in Barrow and Bray (1992). In Becker and Hinton (1992)the IMAX method was introduced. IMAX models work by attempting to maximize the mutual information between different output units (Becker and Hinton 1992; Zemel and Hinton 1991), or between outputs of each of a number of units over successive time steps (Becker 1992). However, these models suffer from several drawbacks. The IMAX merit function has a high proportion of poor local optima. In Becker and Hinton (1992) this problem was ameliorated by using a hand-crafted weight-sharing architecture. In Becker (1992) the ”tendency to become trapped in poor local optima” (p. 367) was addressed by introducing a user-defined regularization parameter to prevent weights from becoming too large. The IMAX models require that the training data are presented to the network twice per weight update, whereas a biologically plausible model should only use quantities that can be computed on-line. The computationally expensive weight update process used by IMAX requires large amounts of processing time, which, in Zemel and Hinton (1991), increases with the cube of the number of independent visual parameters (e.g., size, position, orientation) implicit in the input data. More recently, Phillips et al. (1992) emphasized how different sensory modalities, or streams within modalities, often signal aspects of the input that are correlated (e.g., the sound and sight of a word being spoken). Using an unsupervised algorithm, they demonstrate that the correlations between different streams of synthetic data can be used to discover the
James V. Stone
1466
I
*
.
I
Imaged region
Surface
4
Translation Continuous change in depth
Figure 1: Schematic diagram of how surface moves over time. Surface depth varies continuously as the surface translates at a constant velocity parallel to the image plane. features that are related across streams as well as discovering the relations between them. 3 Learning Using Spatiotemporal Constraints
~
~
~
The input to a perceptual system can be characterized as a vector in a high-dimensional space. In humans, there are approximately 10' fibers in the optic nerve, so that the retinal image can be considered as a vector with lo6 components. As the retinal image changes over time, this inragr iwtov describes a trajectory through the high-dimensional space. However, this vector is generated by events in the physical world, and can therefore be described in terms of a small number of parameters (e.g., surface depth). It is these parameters that are useful to an organism. A large part of the problem of perception consists of extracting these physical parameters implicit in the changing image vector. Consider a sequence of images of an oriented, planar, textured surface that is moving relative to a fixed camera (see Fig. 1). Between two consecutive image frames the distance to the surface changes by a small amount. Simultaneous with this small change in surface depth, a relatively large change in the intensity of individual pixels can occur. For
Learning Visual Parameters Using SmoothnessConstraints
1467
example, a one-pixel shift in camera position can dramatically alter the intensity of individual image pixels, yet the corresponding change in the depth of the imaged surface is usually small. Thus, for a moving camera or surface, there is a difference between the rate of change of the intensity of individual pixels and the corresponding rate of change of parameters associated with the imaged surface. A perceptually salient parameter is therefore characterized by variability over time, but the rate of change of the parameter is usually small, relative to that of the intensity of individual image pixels. More importantly, a sequence of images defines an ordered set in which neighboring images are derived from similar physical scenarios, and that the temporal proximity of images provides a temporal binding of parameter values. It is this temporal binding that permits us to legitimately cluster temporally proximal images together, and thereby discover which invariances they share. It is possible to constrain the learning process such that a sequence of outputs is forced to possess the general characteristics of perceptually salient parameters (e.g., temporal smoothness). This can be achieved without specifying the required output value for any given input. Using an unsupervised method, the value of each output is essentially arbitrary, but each output is uniquely and systematically associated with a particular input parameter value. An “economical” way for a model to generate such a set of outputs is to adapt its connection weights so that the outputs specify some perceptually salient parameter implicit in the model’s inputs. That is, it is possible to place quite general constraints ( e g , temporal smoothness) on the outputs, such that the ”easiest” way for a model to satisfy these constraints is to compute the value of a perceptually salient parameter. Such constraints do not determine which particular parameter should be learned, nor the output value for any given input. Instead, these constraints specify only that particular types of relations must hold between successive outputs. Finally, although a learning method may be based on an assumption of temporal smoothness, violations of the smoothness assumption can be tolerated without compromising learning of perceptual parameters (see Experiment 2.3). 4 The Learning Method
A model that uses a type of temporal smoothness constraint can be made to learn visual parameters. The degree of smoothness of the output or state of a model unit can be measured in terms of the “temporally local,” or short-term, variance associated with a sequence of output values. A sequence of states defines a curve that is maximally smooth if the variance of this curve is minimal (the straighter the curve, the smoother the output). However, minimizing only the short-term variance has a trivial solution. This consists of setting all model weights to zero, generating a horizontal output curve. This is consistent with one characteristic
James V. Stone
1468
(smoothness) of perceptually salient parameters, but it is not very useful. Moreover, it does not conform to the other characteristic, variability over time. The output can be made to reflect both smoothness nmf variability by forcing it to have a small short-term variance, and a large long-twm variance. Accordingly, the variance of the output over small periods should be small, relative to its variance over longer periods. The general strategy just described can be implemented using a multilayer model. Units in the input, hidden, and output layers are labeled i, j , and k, respectively. Input and output layers have linear units, and the hidden layer has t m l z units. The state of an output unit at each time t is Zkf = 1, z I t rwhere 70,k is the value of a weighted connection from the jth hidden unit to I ( , and ZIt is the state of the jth hidden unit. The desired behaviour in zk can be obtained by altering interunit connection weights such that & has a large long-term variance V , and a small short-term variance U. These requirements can be embodied in a merit function (the k subscripts have been omitted here):
is
The are formally redundant, but have been introduced to simplify the derivatives of U and V . The cumulative states 2, and t,are both exponentially weighted sums of states 2 : t , = X 5 i ,
2, = A1
+ (l-Xs)z,,-],:
z, +
(1 - X L ) z , l & l )
'
0<Xs<1
(44
0 5 XI 5 1
(4.3)
Where XL and As are time decay constants. The half-life h l of A( is much longer (typically 100 times longer) than the corresponding half-life 12, of As. Note that 4.1 is invariant with respect to the magnitude of z , and therefore with respect to the magnitude of the weights. Therefore, no weight normalization is required. In the experiments described below, the pattern of weights altered during learning, but variation in the magnitude of weights was relatively small. The derivative of F with respect to output weights results in a learning rule that is a linear combination of Hebbian and anti-Hebbian weight update, over long and short time scales, respectively.' Given a hidden unit with output state zl,, short-term mean iIt, and long-term mean z,,, which projects to an output unit with state zii, short-term mean &, and long-term mean Zkr:'
'Thanks to Harry Barrow and Alistair Bray for pointing this out. 'For hidden unit weights, additional terms resulting from the tnrilr hidden unit acti\,ation function are required (see Appendix A).
Learning Visual Parameters Using Smoothness Constraints
1469
The Hebbian and anti-Hebbian components are given by the first and second terms (respectively) on the right-hand side of 4.4. The pre- and postsynaptic means used in conventional Hebbian learning rules (e.g., Sejnowski 1977) have been replaced by the exponentially weighted means, Z,, and Zkt (respectively) in the Hebbian part of 4.4, and by 5jt and Z k t in the anti-Hebbian part of 4.4. In contrast, the rule described in Bienstock et al. (1982) uses the exponentially weighted mean of only the postsynaptic output to modulate learning, and this learning is either Hebbian or anti-Hebbian, depending on the state of the post-synaptic unit. In summary, the rule defined in 4.4 uses the exponentially weighted mean of both the pre- and postsynaptic states to modulate both the Hebbian and anti-Hebbian learning applied to every weight. The learning algorithm consists of computing the derivative of F with respect to every weight in the model to locate a maximum in F. The derivatives of F with respect to weights between the input and hidden unit layers are required. These derivatives are computed using the chain rule (but not the learning method) described in Rumelhart et al. (1986). The cumulative result of these computations is used to alter weights only after the entire sequence of inputs has been presented. However, storage requirements are minimal because all quantities can be computed incrementally (see Appendix A). This can also permit weights to be updated after the presentation of each input, as demonstrated in Stone and Bray (1995). A conjugate gradient method (Williams 1991) was used to maximize F.5 Each iteration, or line search, involves quadratic interpolation in a given conjugate search direction, so that each line search requires the derivative of F at two points along the current search direction. Note that, whereas weight changes depend on the history of inputs to a unit, a unit’s state z is a function only of the current input. An information-theoretic interpretation of F is given in Appendix B. 5 Experiments 5.1 Model Architecture. The model consists of three layers of units, as shown in Figure 2. Units in each layer are connected to every unit in the next layer. The first layer consists of 20 linear input units, arranged in two rows. The hidden layer consists of a fixed number of between 3 and 10 units. The state of a unit in the hidden layer is z = tanh(x),where x is the total input to a hidden layer unit from units in the input layer. The input to the jth hidden unit is x,,= Ci(w,,zi+ O,), where wi, is the value of a weighted connection from the zth input unit to the jth hidden unit, and z , is the state of the ith input unit. All and only units in the hidden layer results not reported here, an iterative weight update rule that took steps of size 7)
aF/aw (where 7 was adjusted automatically) was about 10 times slower than the
conjugate gradient method.
1470
James V. Stone
Figure 2: Network architecture. The network has 20 input units, arranged in two rows of 10, between 3 and 10 tnnh hidden units, and one linear output unit. At each time step, a pair of one-dimensional stereo images is presented at the input layer. have a bias weight 0 from a unit with constant output of 1. These bias weights are adapted in the same way as all other weights. The output layer consists of a single linear unit, as described in the previous section. 5.2 Experiment Series 1: Random Dot Stereograms. 5.2.1 lnput Data. The input data consisted of a temporal sequence of one-dimensional random dot stereograms. The sequence of stereograms was derived by simulating the motion of a planar surface which was both translating at a fixed velocity while oscillating in depth (see Fig. 1). The image dot-density was 0.167 dots/pixel throughout the sequence of stereograms, and each stereogram was presented for one time step. The amount of linear shift or dispizrity between the left and right images of each stereo pair varied between kl image pixel. The disparity values were generated by convolving a one-dimensional circular array of 1000 random numbers with a gaussian of standard deviation of 100, and then normalizing these values to lie between f l . The one-dimensional random dot "surface" from which images were derived was constructed by placing Is randomly in an array. The graylevels of two adjacent image pixels were derived from the means of two nonoverlapping surface regions, in which each region contained 10 surface elements. Image subsampling of the surface gray-levels allowed subpixel disparities to be generated. For example, if members of a stereo
Learning Visual Parameters Using Smoothness Constraints
1471
pair were derived from surface regions that overlapped by one surface element (=0.1 of a pixel width) then the images had a disparity of 0.1 pixels. The gray-level profiles of a sample of typical stereo pairs are shown in Figure 3. The smoothing effects of the optics of the eye were simulated by smoothing the image gray-levels. To save computer time, this image smoothing was achieved by smoothing the surface gray-levels (using a gaussian with a standard deviation of 10 surface elements). The set of surface gray-levels was then normalized to have zero mean and unit variance. To simulate the translation of the surface, the first image 1, of a pair was moved along the surface by 20 surface elements (=2 image pixels) at each time step, and the gray-levels of the surface were read into the image array (with 10 surface elements to each image pixel). The second image 12 of a pair was aligned with I,, and then shifted along the surface according to the current disparity value, which varied between fl pixel (equivalent to f 1 0 surface elements, see above). 5.2.2 Network Parameter Values. The half-life ks of As in 4.2 was set to 32 time steps. In this set of experiments, the long-term variance was set to the variance of z, so that Z was equal to the mean state of a unit. This produced results which are not noticeably different from the method described above, and which was implemented for the second series of experiments (see below). As stated previously, each of the 1000 stereo pairs had a disparity that was determined by a circular array of 1000 smoothly varying disparity values. This circular array permitted it to be computed for t < 1. The initial weights were set to random values between f0.3. 5.3 Experiment 1.1: Discovering Stereo Disparity. The following results were obtained by maximizing F using a conjugate gradient method (Williams 1991). The model was tested on stereo pairs of images (see Fig. 1, 2, and 3).
5.3.1 Results. The number of hidden units was reduced between different runs to discover the minimum number required to solve the task. If between 4 and 10 hidden units were used then the model always succeeded within 100 conjugate gradient line searches, with lrI > 0.9, where r is the correlation between the output unit state z and stereo disparity. Using less than four hidden units did not always result in convergence within 100 line searches (see Experiments 1.5 and 2.2 for convergence results). 5.4 Experiment 1.2: Hidden Unit Weight Vectors. A network with three hidden units was used for further analysis. After 100 conjugate
James V. Stone
1472
- 2 2
2
4
6 8 Pixel Position
10
T
1
I
-2' 2
4
6
Pixel Position
8
I 10
.a -1
Figure 3: Examples of random-dot input image pairs. Each graph shows the overlaid gray-levels of one image pair. The grav-levels of the left (- - -) and right (-) images are plotted for each of the 10 pixel positions o f a stereo pair. Each image was obtained b y subsampling a gray-level random-dot surface (see text).
Learning Visual Parameters Using Smoothness Constraints
I
4.5
I
,
,
1473
- 1
4
-0.8 3.5 0 .r
m
z 2
\
>
-0.6
I I:
3
2.5
2
3 g -0.4
I::
1.5
-0.2
1
0
-
8 (D
I
I
20
40
I
60
ao
0 100
Learning Iteration
Figure 4: Graph of learning iteration number versus V / U ratio (- - -) and correlation Y (-) between output unit state z and disparity during conjugate gradient learning. gradient iterations r = -0.943 (see Fig. 4). Output curves plotted during the learning process are given in Figure 5. The output unit state after 100 iterations was plotted against the time-varying disparity in Figure 6. Results using a surface on which every element has a random gray-level (instead of the random dots used here) are similar to those reported here (see Stone 1995). Note that the correlation r between unit outputs and disparity is negative for some of the results presented here. It is not necessary, nor even desirable, that r should be positive. If we consider the output unit as part of an integrated system that computes the values of different visual parameters then it is necessary only that a unit is able to signal to other units information regarding some aspect of the input. This can be achieved equally well with either a high magnitude negative or positive value of Y . The weight vector of each of the three hidden units after 100 conjugate gradient line searches is shown in Figure 7. In each graph, the weighting given to corresponding pixels of a stereo pair is shown on a common (pixel position) x-axis. Thus, the 20-dimensional weight vector
1474
James V. Stone
02 c
B
a
01
O -0.1
t 0 - 3
0
200
4w
600
Time
800
1WO
-0.4
0
200
400
0
200
400
c
M)o
800
1wo
600
800
loo0
Time
kemumAQ
0
-24-
0
200
400
600
Time
.
,=.
~
200
I i 400 600 Time
800
i 800
1000
1 1000
Time
Figure 5: Graphs of output unit state 2 (- - -) and t (-) versus time at different learning iterations. At each of 1000 time steps the network was presented with a stereo pair of images. The disparity of these input images varied over time as depicted in Figure 6.
of each unit is shown as two overlaid 10-dimensional weight vectors. The weight values between each hidden unit and the output unit are { -2.563. 2.607. -2.796) for units labeled 1, 2, and 3, respectively, in Figures 7 and 8. For each hidden unit, the disparity of every image pair from the training set was plotted against the corresponding hidden unit state (see Fig. 8). Units 1 and 3 appear to have antisymmetric response profiles. This is consistent with their weight vector profiles.
Learning Visual Parameters Using Smoothness Constraints
1475
6
-1
-0.5 4
0 x
Y
'C
m
0.5
2
5
Q
.-m 0
0 3 h
1
N
v
0 1.5 2
-2
2.5
0
200
400
600 Time
800
1000
Figure 6: Graph of output unit state z (- - -), t (-), and stereo disparity () versus time after 100 conjugate gradient line searches. The jagged appearance of the latter is due to the fact that the smallest change in the disparity of input image pairs is 0.1 of a pixel (see section on Input Data). 5.5 Experiment 1.3: No Hidden Units. As expected for a system that attempts to discover a nonlinear input/output mapping, learning without a hidden layer of units failed to compute disparity. With the output unit connected directly to the input layer, correlations of IY( > 0.01 were not exceeded over 10 different simulations with different initial random weights. 5.6 Experiment 1.4: Generalization. If the model has learned disparity (and not some spurious correlate of disparity) then it should generalize to new stereo data sets, without any learning of these new sets. Accordingly, the network described in Experiment 1.2 with a correlation Y = -0.943 was tested with test data consisting of 1000 stereo pairs. These were obtained from a new surface constructed in the same manner as was used for the original data set. During testing, the disparity varied exactly as before, but now each image pair was derived from a random point on the surface. (This tested the ability of the system to estimate disparity independently of the particular gray-level profiles in each image pair). Using these data the correlation was Y = -0.937.
James V. Stone
1176
Unit Weights 1
~ _ _ _
3
I
2
3 2
4
6 Image Position
2
4
6 Image Position
Z L
~
2
6 Image 'osifion
4
a
10
_---I
8
10
8
10
Figure 7: Weight vectors of the three hidden units. Wt,ight v a l ~ i t yprojecting i r o m l e f t (-) and right (- - -) images of a stereo pair ti) oiie hidden unit are 5hctiyn in each graph.
1477
Learning Visual Parameters Using Smoothness Constraints
ynit 1
-1
-0.5
0 Disparity
-0.5
0 Disparity
-0.5
0 Disparity
0.5
1
Unit 2
-1
0.5
1
Unit 3
-1
0.5
1
Figure 8: Scattergrams of hidden unit state versus stereo disparity for 1000 image pairs. Each graph shows the scattergram of a single hidden unit. (The vertical striations are an artifact of the disparity values which varied in intervals of 0.1 of a pixel.)
James V. Stone
1478
90
80
‘4
0
70 60
A
v L
t
c -
50 40
30 20
0
5
10
20 Half life
15
25
30
35
Figure 9: Graph of number of iterations required to exceed r = 0.5 versus half-life 175 of As.
Note that the rate at which disparity varies has no effect on this test correlation. This is because, while Ieaniirig depends upon current and previous states, the state z of a unit at any time depends only upon the current input. Thus, after learning, the model can detect disparities irrespective of the rate at which disparity varies in the input.
5.7 Experiment 1.5: Effect of Half-Life. The relation between the half-life h S of As and the rate of change of disparity clearly has implications for learning. The effect of hs on learning was tested, with 10 hidden units, using the random-dot data used in Experiment 1.1. Within limits, the value of 11s was found not to be critical to the success of the learning method. The magnitude of the correlation between unit state and disparity was IrI < 0.9 only for simulations with a half-life h~ < 2 time steps. A graph of hs versus the number of conjugate gradient iterations required such that IYI > 0.5 (denoted by [iter(Irl> 0.5)])is shown in Figure 9. A graph of hs versus [iter(IrI > 0.5)]-2 gives a straight line with a correlation of 0.986 for h S = {2.4.8.16,32}. In all cases, additional learning resulted
Learning Visual Parameters Using Smoothness Constraints
1479
in a final value of IY( > 0.9. A network with three hidden units gave qualitatively similar, but less systematic, results. Setting hS too high can increase the rate at which the function F is maximized without resulting in a high correlation Y . For values of hs > 32, the function F was maximized, but a graph of z versus time yielded a bell-shaped curve. In this case, each value of z corresponded to two values of disparity. As can be seen from Figure 9, hs acts somewhat like an annealing parameter, smoothing out local maxima in F at high values of h ~ This . is true in general, but is more apparent if (as above) hL = CXI so that V is the variance of z. Now, as hs + co, so U -+ V , and therefore V / U + 1, for any weight values. As in Hopfield (1984), Durbin and Willshaw (1987), and Stone (1992), at high "temperatures" (high values of hS)the energy surface defined by F is convex, and there exists a single maximum. As the temperature is reduced, the energy function becomes increasingly nonconvex. A series of decreasing temperatures is associated with a corresponding series of increasingly nonconvex energy functions. At each temperature, the maximum of the current energy function can be used as the starting point to search for the maximum of the next function (associated with a new, lower temperature). This brief sketch is consistent with results given in Figure 9, which shows an inverse square relation between h and the rate of convergence. Annealing methods are well suited to error surfaces containing many local extrema. For the tasks tested in this paper, only a slight advantage was observed when hS was annealed in a variety of tests. This suggests that the error surface defined by F has relatively few local extrema. However, it may be that annealing h~ provides significant improvements in the rate of convergence and in the ability to find deep extrema for more complex tasks.
5.8 Experiment 1.6: Hierarchical Learning. If we associate a function F, = log(V,/U,) with each hidden unit 11, in the hidden layer then each unit can independently and simultaneously adjust its weights to maximize its Fi. Note that Fi is defined only in terms of the states of unit zi,. We can then freeze the input-hidden unit weights, and maximize F (as above) with respect to the hidden-output weights. This hierarchical learning method does not require the backpropagation of an error signal between successive unit layers. Instead, each weight of each unit ti, is updated according to the derivative of F, with respect to that weight. Neither this nor the method described above requires a conjugate gradient method, though learning is about 10 times faster if conjugate gradients are used. The correlations between the outputs of individual hidden units and disparity were (0 60.0.44. 0.45. 0.59. -0.09.0.47. 0.46. -0.60.0.55. -0.40).
James V. Stone
1-280
‘Typically, each unit state varied monotonically oi’er a small range of disparities, and was almost constant outside of that range. Using the hierarchical learning method, the correlation between the single output unit state 7 and disparity for the test data was r = -0.74, with 10 units in the hidden layer. This was achieved after 10 conjugategradient updates of weights between the hidden and output unit layer only. The highest xralue of I’ from the hidden units was 0.6. From a statistical perspective, this hidden unit accounts for 0.36 ( = 0.6’) of the variance in disparity. This contrasts with a value I’ = -0.74 of the output unit which accounts for 0.55 (: -0.71’) of the variance. Therefore, a linear combination of outputs of hidden units (as implemented via the addition ot an output unit) accounts for about 1.5 times the amount of variance in disparity than is accounted for by any single hidden unit. The network used in these experiments did not use any form of conipetition between hidden units. I t is pssiible that such a mechanism might force each unit to become sensitive t o a small range of parameter values in a more efficient manner than tcas obtained here. 5.9 Experiment Series 2: Gray-Level Images.
5.9.7 I i i p r t D(iti7. The input data were derived from a gray-level image (see Fig. lo), using 1000 synthetically generated disparity values (see below). The image in Figure 10 was convolved with a difference of gaussian (DOC) filter to reduce the range of spatial frequencies present in the image. This procedurc, also simulates the action of retinal ganglion cells with center-surround recepti1.e fields. The ratio of gaussian standard deviations was 1.6, with the smaller gaussian having ‘3 standard deviation of 2 pixels. One image I, of a stereo pair xvas copied from a 10-pixel image strip of Figure 10. The position of this strip was advanced by two pixels per time step. The position o f the second image I? of a pair was the same as I,, except for a linear shift given by the current I~alueof disparity (subpixel shifts were obtained by linear interpolation of gray-levels in Fig. 10). Each input image was normalized to have zero mean and unit variance. ‘The disparity Lraried sinusoidally over time with a period of 1000 time steps between il pixels. This ensured that the v a l w of disparity at t = 0 is equal to the disparit)?at t 1000. The fact that disparity was defined by a circular array of continuously 1,arying values was used to initialize the value of ti,,as follows. The 1000 stereo pairs data were presenttd t o the network once, then the value of f t l i wds set t o 5 ~ , , , , l , ~In. contrast, it would require se\wal complete prescmtations of the data set to achieve a stable \ , d u e for Ziti because it has long half-life (3200 time steps). To a first approximation, this stable \ . ~ I L I C is cqual to the mean of z . Accordingly, the initial value of ti,, m7as 2-
Learning Visual Parameters Using Smoothness Constraints
1481
Figure 10: Parrot image used for learning. set to the mean value of z after one complete presentation of the set of 1000 stereo pairs.h
5.9.2 Parameter Values. The values of the half-lives hs and h l ~were set to 32 and 3200 time steps, respectively. The network had 10 hidden units. Unless stated otherwise, parameter values were the same as in previous experiments. 5.10 Experiment 2.1: Gray-Level Images. After 10 conjugate gradient iterations the correlation was r = -0.546, and by 60 iterations it was r = -0.971. The set of weights learned by this network at 60 iterations was used to test its performance (without further learning) on data derived from the test image shown in Figure ll. This data was generated using the same temporal sequence of disparities as was used during learning. For data derived from this test image, r = -0.959. 5.11 Experiment 2.2: Convergence and Local Optima. This experiment was designed to test the reliability and speed of convergence of the hFor optimization techniques (as in Williams 1991) that take large steps on the error surface defined by F , ik,] and Zkil should be reinitialized at the start of each iteration.
Figure 1 I : Coral image ustd ior testing
learning method Experiment 2.1 was repeated 100 times, with each run being tcrminated rvhen 1 .1 0.9. ,411 100 runs converged with 11.1 ::,0.9. The median number of iterations required such that 1 .1 ;’ 0.9 was 69, ivith a minimum of 31 and ‘1 maximum of 132 iterations. Overall, these results indicate that at least for the task demonstrated hwe, the function F has a low proportion of poor local maxima. This, in turn, suggests that the “tmergy landscape” defined by F is relatively smooth, allowing it to be traversed by simple search techniques. More imprtiintly, it suggests that maxima are reliably associated with model weight values that enable the detection of perceptually salient paramettn. 5.12 Experiment 2.3: Discontinuous Stereo Disparity. A critical assumption of the method is that the perceptual parameter implicit in the input data \,arks smoothly m.er time. Obviously smoothness is defined relative to a particular temporal scale, parameterirtd by the half-life 11,. in the model. However, given a particular temporal scale, how d o discontinuities affect learning? The effect of discontinuities in stereo disparity on le‘irning is shown in Figure 12. The sequence of 1000 disparities was obtained by taking the 1000 sinusoidally varying disparity values used
Learning Visual Parameters Using Smoothness Constraints
, 0
200
400
1483
, 600 Time
800
Figure 12: Graph of state z (- - -) and disparity (---) learning with temporal discontinuities in disparity.
1000
versus time after
so far in this series of experiments, and swapping the positions of four subsequences of 250 disparity values with each other. After 100 conjugate gradient learning iterations with 10 hidden units, r = -0.916 for the training set derived from Figure 10, and r = -0.901 on a test set derived from Figure 11. Clearly, a small proportion of discontinuities in an otherwise smoothly changing parameter does not seriously compromise the learning method. 6 Discussion
The sequences of synthetic stereograms used in this paper differ from stereograms obtained from a natural scene in the following ways: (1) the disparity is constant across the input array, ( 2 ) no discontinuities in disparity occur within any stereo pair, and, (3) stereo disparity is the only quantity that varies smoothly over time. Given these limitations, the main contribution of this paper is to demonstrate that a learning rule, based on a generic assumption about temporal smoothness, can be used to reliably learn a nontrivial function (stereo disparity). The stereo disparity task learned is a hyperacuity task. That is, the amount of disparity is less than the width of a receptor (pixel). This is consistent with ysychophysical studies that demonstrate that subjects can discriminate disparities as small as 2 sec of arc, less than one-tenth of the
1484
James V. Stone
width (30 sec of arc) of a retinal receptor (Westheimer 1994). Members of a stereo pair that have a subpixel disparity differ in terms of the local slope and curvature of their intensity profiles, and not necessarily in terms of the positions of the peaks and troughs in these profiles. Thus, detecting subpixel disparities cannot be achieved by a pixel-to-pixel correspondence between images in a stereo pair. It requires comparisons of relations-between-pixels in one image with relations-between-pixels in the other image. The resultant meta-relation, which specifies the amount of disparity in each pair, is of a high order. The only means available to the model to discover this parameter was the assumption of its temporal smoothness. 6.1 More Than One Perceptually Salient Parameter. In the experiments described here, disparity is the only parameter that varies smoothly over time. It may not therefore seem surprising that the network learned to detect disparity. Given a sequence of stereo images obtained from a moving pair of cameras, a system based on this type of smoothness assumption is likely to discover simple parameters in preference to disparity. For example, like disparity, luminance tends to vary smoothly across time and space. However, with regard to the human visual system, simple parameters such as luminance are not a major determinant of neuronal activity in area V1, because only a relatively small amount of luminance information is preserved (Nicolls, et al. 1992, p. 630). So, although such simple parameters could present problems for an artificial neural network, these problems may not be encountered in the central nervous system. If more than one "nonsimple" perceptually salient parameter (e.g., depth, color, motion) is implicit in the input sequence then these may be kept separate in a network architecture, which is analogous to the separate processing streams in the human visual system. For example, information regarding parameters such as color and motion are thought to be transmitted by distinct retinal ganglion cells, and this information remains segregated within the LGN and V1 (Livingstone and Hubel 1988). Thus, discovering a perceptual parameter derived from one (or more) of these processing streams is not necessarily precluded by the presence of many perceptual parameters in the retinal image.
6.2 Effect of Half-Life. The assumption of temporal smoothness was implemented via a time decay constant As with half-life \is. Consider a model in which each output unit has a different value of 12s. Choosing a value for hs implicitly specifies a temporal "grain size," and therefore restricts learning to parameters that change at a particular rate. Perceptually salient events (such as motion) occur within a relatively small range of temporal windows. At rates of change that are either too high or too low, events cannot be detected by a given unit. Between these two ex-
Learning Visual Parameters Using Smoothness Constraints
1485
tremes, different units have temporal tuning characteristics (defined by their respective half-lives k s ) that ensure that a range of rates of perceptual change can be detected by some subpopulation of units. Just as an array of differently tuned spatial receptive fields can be used to recognize spatial patterns characteristic of certain parameter values [e.g., surface orientation (Stone and Isard 1995)], so an array of differently tuned temporal receptive fields may be used recognize temporal patterns characteristic of human walkers [e.g., Johansson figures (Johansson 1973)l. The value chosen for hs was important (though not critical) for learning to succeed. However, for any ”reasonable” rate of change of disparity, some units in a population tuned to different temporal frequencies would learn disparity.
6.3 Violating the Temporal Smoothness Assumption. Even if we accept that temporal smoothness is defined relative to a particular value of hs, it might be argued that perceptually salient parameters do not always vary smoothly over time. In the perceptual world, violations of the smoothness assumption are not hard to find. However, such violations do not necessarily undermine the model’s performance. This is because, in practice, the learning method requires only that discontinuities in parameter values are rare, relative to gradual changes over time. In the example presented earlier, four discontinuities every 1000 time steps did not disrupt the learning process, because most of the input data remained consistent with the smoothness assumption. Thus, the model does not require that perceptually salient parameters change smoothly at all times, but (more realistically) that such parameters change smoothly nzosf of the time.
6.4 Detecting Perceptual Parameters with Different Half-Lives. In Hinton (1989a, p. 208) it was suggested that if the hidden units of an autoencoder network were constrained to vary their states slowly over time then they might encode invariant input parameters such as object identity, and, in Hinton (1989b), that conventional hidden units could be used to encode transient quantities such as object position. However, it may be unrealistic to expect the physical world to partition neatly into invariant (e.g., object identity) and transient (e.g., object position) parameter types. One alternative would be to encode parameters with varying degrees of temporal transience using units with different half-lives. It might then be expected that units with long half-lives would reflect invariant input parameters such as object identity, units with shorter half-lives might encode color (for instance, if a single type of object were presented in many colors), and those with even shorter half-lives might encode transient parameters such as the position of that object.
1486
James V. Stone
6.5 Interpreting z. The output z can be interpreted as a noisy version of the temporal average i (see Appendix B). As can be seen, i provides a more accurate measure of disparity than z. Although z is labeled as the output of a unit, the precise correspondence between units and neurons has not been specified. If such a correspondence were to be established then it is likely that i would be equated with the neuron membrane potential because it is commonly assumed that a neuron behaves like a leaky integrator. This results in a membrane potential that, like i, tends to decay exponentially with time.
6.6 A Spatial Analogue. It is noteworthy that there exists a spatial analogue to the temporal method presented here. Rather than defining V and U over time, these can be defined over a spatial array of units, so that i and Z are short-range and long-range spatial means, respectively (Eglen et al. 1996). In this case, maximizing F forces neighboring units to have similar outputs (low U), but units separated by large distances are forced to have dissimilar outputs (high V ) . It may be possible to use this method to model the topology-preserving mappings of V1 and V2 in which the value of a perceptual parameter varies smoothly across an array of units. This paper is not intended to suggest that all human perceptual development can be modeled using a generic neocortical learning strategy. However, it is proposed that, in evolutionary terms, a small class of generic learning strategies could have acted as an economical foundation for building neocortical systems to analyse perceptual inputs.
7 Conclusion
Marr’s observation that “the world is continuous almost everywhere” (Marr 1982) has been applied to the temporal domain, in which it is assumed that (to paraphrase Marr) the world is continuous almost every7oheii. This assumption was used to derive a learning method that, when presented with a sequence of stereo images of a moving surface, discovered precisely that parameter (stereo disparity) that described the behavior of the imaged surface through time. The model may lend itself to the self-organized construction of hierarchical systems, in which successive layers compute increasingly higher order parameters, with the highest layers performing object recognition. Whatever the limitations of the particular model presented in this paper, any learning system that does not take advantage of the temporal continuity of perceptual parameters implicit in its input discards a powerful and general heuristic for discovering perceptually salient properties of the physical world.
Learning Visual Parameters Using Smoothness Constraints
1487
Appendix A: The Learning Algorithm The learning algorithm relies upon batch update of a weight vector w, which contains all weights in the network. At each time step t, a stereo pair is presented at the input layer, and the derivative of F with respect to every weight in the network is computed and added to a cumulative weight derivative vector VF,. This derivative vector is used to update w only after all T = 1000 stereo pairs have been presented at the input units. The same set of stereo pairs is repeatedly presented in the same order during learning. Storage requirements are minimal because all quantities required for learning can be computed incrementally.
Notation. Units in the input, hidden and output layers are indexed by subscripts i. j , and k, respectively. For example, a weight that connects hidden unit u, to output unit ilk is denoted wik. The state zkt of t i k at time t is Zkf = 1, zulkzif, where zi, is the state of u,. Input and output layer units have linear activation functions, whereas hidden units have a nonlinear (tanh) activation functions. Weights connecting input to hidden units, and hidden to output units, are referred to as lower and upper weights, respectively. The function to be maximized is F = log V / U , where
u
J
=
1/2
C(ikt- q t ) 2 f=1
v
=
1/2
k(S,
- Zk,) 2
f=l
V is the long-term variance of zk, U is the short-term variance of zk,and T is the period over which they are defined (numerically equal to the number of inputs here). Both V and U are defined in terms of exponentially weighted means of z k . The weighted means Zk and zk differ only in terms of their respective exponential rates of decay: &i
=
Zkf
= X I Zk(,&l)
X.5 %(,-I)
+ (1
+
AS) Z k ( t - 1 ) : (1- A,) Zk(f-1) : -
0 5 AS 5 1 0 5 XI 5 1
(A.1) (A.2)
where As and XI are decay constants with half-lives of ITS and lzl, respectively, with hL >> /is. The formula for obtaining a value of X for a given half-life h is X = 2'1". (Note that Z k ( J - 1 ) contributes to ?kf and Z k f , but not to %(t 1) and Zk(t-11.)
Evaluating aFla7u. The derivative of F with respect to any weight iu is ('4.3)
James V. Stone
1488
The identical form of U and V permits us to derive i ) U / i ) ? o for lower and upper weights, from which corresponding equations for V can be obtained by substitution. The incremental computation of U up to time f is
u(t)= U ( t - 1)+ 1/2 ( t k , - Zk/)2 Therefore the derivative of U ( t )with respect to any weight
(A.4) 7u
is
Thus, the incremental computation of A.5 depends upon evaluation of &At/dw in A S and i)zk(l-l)/i)w in A h , for both upper and lower weights. Equations for i ) Z k / / & u (for upper and lower weights) will be derived next, from which equations for dzk(,-,,/dzu can be obtained by substitution.
Evaluating i ) Z k / / i ) W l k . In the case of an upper weight zulk projecting to an output unit i l k :
where zIl is the state of the jth hidden unit at time t, so that
Evaluating i ) ~ k f / d w , ~ . Equation A.7 can be rewritten in terms of a lower weight 7 u l I , which projects to a hidden unit t i i :
where zII is the state of the ith input unit at time t. The derivative of qf with respect to wll is
For an upper weight, this yields
Learning Visual Parameters Using Smoothness Constraints
1489
where I~Z~(,-,,/~)IU is recursively defined as in A.6. The corresponding derivative for a lower weight is
Thus A.5 can be incrementally evaluated up to any time t. Corresponding equations for i)V/ilzu can be obtained by substitution in the derivation of i)U/i)zu. The quantities U and V are by definition simple to compute incrementally. Therefore, A.3 can be computed on-line. For results presented in this paper, weights were adapted only after weight derivatives obtained with 1000 inputs had been accumulated. On-line weight update is possible, if, at each time step t, the cumulative values of U ( t )and V ( t ) are good estimates of U ( T ) and V ( T ) ,respectively. This was achieved (for a different learning task) in Stone and Bray (1995) by defining U ( t ) and V(t) as exponentially weighted moving averages: (A.lO) (A.ll)
where 3 >> XI. Appendix B: An Information-Theoretic Interpretation Mutual Information. Consider a gaussian signal X = S, and a noisy version Y = (S+ N) of S that has been corrupted by additive gaussian noise N. It can be shown (Reza 1961) that the mutual information I(X; Y ) between X = S and Y = (S + N) is the log of the ratio of two variances; that of Y = (S + N) and of ( Y - X) = N:
Var(Y) Var(Y - X) Vur(S N ) = 112 log Vur(N) Where Var denotes variance. I(X;Y)
=
1/2 log
+
Relation to the Learning A l g ~ r i t h m .The ~ network can be considered in terms of maximizing information with respect to a low frequency (nonlinear) function of the input sequence. The exponentially weighted quantity Z is a low-pass version of Z, and we can therefore view Z as the result of passing z through a filter G with an exponentially decaying impulse response. The effective cutoff frequency fi of this filter is defined by the time constant As of its exponential decay. 7This analysis was suggested by Mark Plumbley of Kings College, London.
James V. Stone
1490
If G is an ideal low-pass filter' then the information capacity lc of G is given by the rate of information transmitted about components of z below f,. The quantity 1~ is given by the difference between the rate of information transmission I, about z and the rate of information transmission I-/,labout components of 2 which are abovef,: Ic = 1. - / : / , I If z has bandwidth B, and G has noise variance the noise added to z as it passes through G is 11&. V then, from B . l 1:
V-
=
(8.2) then the variance of If the variance of 1 is
(B.3)
112 log I I B,
If has bandwidth B,,, then the variance of the noise added to z/,, as it passes through G is IIB/,,. If the variance of is V/,, then the high pass signal zilI can be transmitted through G using capacity
Substituting B.3 and B.4 in B.2, and rearranging:
I,;
V.
112 log- - K (8.5) V/,I where K = 112 log(B,/z/,,),which is a constant defined by the filter decay constant As. The quantity V,,, = Vnr(i/,,) = Var(t- zf), where Vnr denotes variance. If we assume that z/,,has zero mean then V,,, = 2U, as defined in A.l, and B.5 can be rewritten as
I,;
=
=
112 log
);(
-
1/2F-K 03.7) Thus, in maximizing F, the learning algorithm adjusts network weights so as to maximize information transmission regarding a low frequency, nonlinear function of the sequence of input vectors. 1,;
=
Acknowledgments Thanks to R. Lister, S. Isard, T. Collett, J. Budd, N. Hunkin, C. North, and G. Hinton for comments on drafts of this paper, and to D. Willshaw, A. Bray, and H. Barrow for useful discussions. Thanks also to S. Becker, M. Plumbley, and P. Williams for useful discussions on the informationtheoretic interpretation given in Appendix B. Thanks to the anonymous referees, one of whom suggested the idea of an information-theoretic interpretation. This research was supported by a Joint Council Initiative grant awarded to J. Stone, T. Collett, and D. Willshaw. 'An ideal low-pass filter passes only those frequency components below its cut-off frequency f c .
Learning Visual Parameters Using Smoothness Constraints
1491
References
Barlow, H. 1972. Single units and sensation: A neuron doctrine for perceptual psychology? Perceptiori 1, 371-394. Barrow, H. G., and Bray, A. J. 1992. A model of adaptive development of complex cortical cells. In Artificinl Neirrnl Networks 11: Proceedinp of tlw [ritcrnntiorinl CorlfL.rer~e017 Artificial Neurnl Net7vorks, I. Aleksander and J. Taylor, eds. Elsevier, Amsterdam. Becker, S. 1992. Learning to categorize objects using temporal coherence. In N w r n l lrlforrnntiori Processing System 4, J. E. Moody and R. Lippmann, eds., pp. 361-368. Morgan Kaufmann, Sail Mateo, CA. Becker, S. 1996. Mutual information maximization: Models of cortical selforganisation. Network: Coiiipirtntiorz itz N m m l Systems, 7( I), 7-31. Becker, S., and Hinton, G. 1992. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nnture (London) 335, 161-163. Bienstock, E., Cooper, L., and Munro, P. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neirrosci. 2, 32-48. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nofirre (Lorzdoirf 326(6114), 689-691. Eglen, S., Stone, J., and Barrow, H. 1996. Lenrnirig Prrceptirnl Iiionriaizces: A Spntinl Model. Tech. Rep. 404, Cognitive and Computing Sciences, University of Sussex. Foldiak, P. 1991. Learning invariance from transformation sequences. Ncirrnl Coniy. 3(2), 194-200. Gibson, J. 1979. The Ecologicnl Approncch to Visiinl Perceptiori. Houghton Mifflin, Boston. Heywood, C., and Cowey, A. 1992. The role the 'face-cell' area in the discrimination and recognition of faces by monkey. Phil. Trnnsnct. Ro!ynI Soc. London(B) 335, 31-38. Hinton, G. 1989a. Connectionist learning procedures. Art$. ItzteII. 40(1-3), 185234. Hinton, G. 198913. Unsupervised learning procedures. First Sun Aimin1 Lcctrrrc.. Manchester University, Audio tape recording. Hopfield, J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81, 3088-3092. Johansson, G. 1973. Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14, 201-21. Linsker, R. 1988. Self-organization in perceptual network. Computer 105-117. Livingstone, M., and Hubel, D. 1988. Segregation of form, color, movement, and depth: Anatomy, physiology and perception. Science 240, 740-749. Marr, D. 1982. Vision. Freeman, New York. Mitchison, G. 1991. Removing time variation with the anti-hebbian differential synapse. Nezirul Cornp. 3, 312-320.
James V. Stone
1492
~ JBrnir7. II Sinauer Nicolls, J., Martin, A,, and Wallace, B. 1992. From N C I I ~ to Associates, Sunderland, MA. Qa, E. 1982. A simplified neuron model as a principal component analyzer. 1. Mntli. Biol. 15(3), 267-273. Perrett, D., Harries, M., Mistlin, A., and Chitty, A. 1990. Three stages in the classification of body movements by visual neurons. In Zriiiigt,s orid Uizrlrrstnrid;rig: Tliouglits nboiit Iiirogrs, l h s nbout IlridcrstoiidiriS, H. Barlow, C. Blakemore and M. Weston-Smith, eds. Cambridge University Press, Cambridge. Phillips, W., Kay, J., and Smyth, D. 1995. The discovery of structure by multistream networks of local processors with contextual guidance. Nrtiuork 6, 225-246. Poggio, G., Motter, B., Squatrito, S., and Trotter, Y. 1985. Responses of neurons in visual cortex ( v l and v2) of the alert macaque to dynamic random-dot stereograms. Vis. Rcs. 25(3), 397406. Reza, F. 1961. [rrforrrintiori Tlicory. McGraw-Hill, New Ymk. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning representations by back-propagating errors. N~7tiirc3(Loridori), 323, 533-536. Schraudolph, N., and Sejnowski, T. 1991. Competitive anti-hebbian learning of invariants. NIPS4, 1017-1024. Sejnowski, T. 1977. Storing covariance with nonlinearly interacting neurons. 1. Moth. Biol. 4(4), 303-321. Stone, J. V. 1992. The optimal elastic net: Finding solutions t o the travelling 170-174. salesman problem. Proc. Zrit. Car$ Arfif. N w r n l N~,ti.c~orks, Stone, J. V. 1995. Learning spatio-temporal invariances. In Nciirnl Coriiprfntioir arid Ps!/cliology Proceediiigs, L. S. Smith and P. J. B. Hancock, eds., pp. 75-85. Springer-Verlag, Berlin. Stone, J. V., and Bray, A. 1995. A learning rule for extracting spatio-temporal invariances. Network, 6(3), 1-8. Stone, J. V.,and Isard, S. 1995. Adaptive scale filtering: A general method for obtaining shape from texture. l E E E PAMI, 17(7), 713-718. Sutton, R. 1988. Learning to predict by the methods of temporal differences. Miicliiiic Learrz. 3, 9 4 4 . Westheimer, G. 1994. The ferrier lecture, 1992. Seeing depth with two eyes: Stereopsis. Pror. Roynl SOC.Loi!doii 8, 257, 205-214. Williams, P. 1991. A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive science research paper CSRP 229, University of Sussex. Zemel, R., and Hinton, G. 1991. Disc-owring V;t7?UpIi17flnz~nriniitR~lntioiisliipsThnf Clinrarterizt~Objjccts. Tech. Rep., Department of Computer Science, University of Toronto, Toronto. ~
~
~~
Received November 17, 1994; accepted March 26, 1996
This article has been cited by: 2. G. Perry, E. T. Rolls, S. M. Stringer. 2010. Continuous transformation learning of translation invariant representations. Experimental Brain Research 204:2, 255-270. [CrossRef] 3. Sheng Li, Si Wu. 2007. Robustness of neural codes and its implication on natural image processing. Cognitive Neurodynamics 1:3, 261-272. [CrossRef] 4. Richard Turner, Maneesh Sahani. 2007. A Maximum-Likelihood Interpretation for Slow Feature AnalysisA Maximum-Likelihood Interpretation for Slow Feature Analysis. Neural Computation 19:4, 1022-1038. [Abstract] [PDF] [PDF Plus] 5. Wolfgang Einhäuser, Jörg Hipp, Julian Eggert, Edgar Körner, Peter König. 2005. Learning viewpoint invariant object representations using a temporal coherence principle. Biological Cybernetics 93:1, 79-90. [CrossRef] 6. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 7. Jarmo Hurri , Aapo Hyvärinen . 2003. Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural VideoSimple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video. Neural Computation 15:3, 663-691. [Abstract] [PDF] [PDF Plus] 8. D. Martinez, A. Bray. 2003. Nonlinear blind source separation using kernels. IEEE Transactions on Neural Networks 14:1, 228-235. [CrossRef] 9. Laurenz Wiskott , Terrence J. Sejnowski . 2002. Slow Feature Analysis: Unsupervised Learning of InvariancesSlow Feature Analysis: Unsupervised Learning of Invariances. Neural Computation 14:4, 715-770. [Abstract] [PDF] [PDF Plus] 10. James V. Stone . 2001. Blind Source Separation Using Temporal PredictabilityBlind Source Separation Using Temporal Predictability. Neural Computation 13:7, 1559-1574. [Abstract] [PDF] [PDF Plus] 11. Suzanna Becker . 1999. Implicit Learning in 3D Object Recognition: The Importance of Temporal ContextImplicit Learning in 3D Object Recognition: The Importance of Temporal Context. Neural Computation 11:2, 347-374. [Abstract] [PDF] [PDF Plus]
Communicated by Peter Konig
Using Visual Latencies to Improve Image Segmentation Ralf Opara Florentin Worgotter Departmetit of Neuuophysiology, Riihr- Universitiit, 44780 Bochum, Germany An artificial neural network model is proposed that combines several aspects taken from physiological observations (oscillations, synchronizations) with a visual latency mechanism in order to achieve an improved analysis of visual scenes. The network consists of two parts. In the lower layers that contain no lateral connections the propagation velocity of the activity of the units depends on the contrast of the individual objects in the scene. In the upper layers lateral connections are used to achieve synchronization between corresponding image parts. This architecture assures that the activity that arises in response to a scene containing objects with different contrast is spread out over several layers in the network. Thereby adjacent objects with different contrast will be separated and synchronization occurs in the upper layers without mutual disturbance between different objects. A comparison with a one-layer network shows that synchronization in the latency dependent multilayer net is indeed achieved much faster as soon as more than five objects have to be recognized. In addition, it is shown that the network is highly robust against noise in the stimuli or variations in the propagation delays (latencies), respectively. For a consistent analysis of a visual scene the different features of an individual object have to be recognized as belonging together and separated from other objects. This study shows that temporal differences, naturally introduced by stimulus latencies in every biological sensory system, can strongly improve the performance and allow for an analysis of more complex scenes. 1 Introduction
The segmentation of a visual scene is a fundamental process of early vision, where elementary features like orientation, color, motion, texture, and disparity must be grouped together to form discrete objects that have to be separated from each other and from the background. In the brain of the higher vertebrates this could be achieved by utilizing the temporal structure of the neuronal signals (McClurkin etal. 1991; Hopfield 1995). In particular, it has been suggested that synchronization between cells could subserve this purpose (Eckhorn et al. 1988; Gray ct al. 1989; v.d. Malsburg Nrirrnl Conzpi,tntiori 8, 1493-1520 (1996) @ 1996 Massachusetts Institute of Technology
1194
Ralf Opara and Florentin Wiirgotter
1981). According to this suggestion, different objects can be distinguished by the firing phases and the phase-relations of cells associated with them. As a consequence of such a synchronization process, cells that are coding one object will be forced to fire at the same time, while cells coding a different object will fire at different times. Thus ideally no phase difference exists between cells belonging to the same assembly, while the phase differences to cells of another assembly should be much larger. The temporal separation of two objects achieved by such a mechanism, however, cannot exceed more than one half-cycle of the underlying oscillation period, which is about 10-30 msec in the visual cortex (Eckhorn c’t n/. 1993). If we assume that the cell assemblies oscillate with 40 Hz, a i d if the temporal jitter for synchronously firing cells is about 5 msec, then a system that uses phase information to distinguish different objects is only able to separate less than three objects. To improve the performance of oscillatory networks used for complex visual scene analysis some authors employ the so called focirs of ntteiitioii (Milanese 1993; Niebur ct ( I / . 1993; Olshausen t t ill. 1993; Fellenz 1994; Niebur and Koch 1996; for a recent re\,iew see Posner and Dehaene, 1994; Desimone and Duncan 1995). The idea is to label a certain salient part of a visual scene and to restrict the segmentation to objects included in this area. This way only a few objects are present in the labeled part. The time differences (phases) between cells coding different objects are, therefore, sufficient to analyze such a restricted area. After a certain time or after recognizing all objects the attention is shifted to other parts of the visual scene and the synchronization process starts again. Although this is a very intuitive way to analyze complex visual scenes, it should be realized that in this case the segmentation is processed serially and the processing time is strongly dependent on the number of objects and the complexity of the scene. To avoid such a serial processing we suggest an alternative approach. The basic idea presented in this study relies on a contrast-dependent propagation delay in a multilayer network. The physiological background we take into consideration is based on the fact that visual lattwcies depend on the stimulus contrast. They can span large ranges of 30-130 msec in the cortex (Levick 1973; Bolz c’t i i l . 1982; Sestokas t ~ ul. f 1987), measured after a significant proportion of the response has built u p (e.g., fourth spike, Bnlz tzt n l . 1982). The visual latencies in the cortex (Ikeda and Wright 1975; Raiguel t’t [ I / . 1989) could, therefore, be used for a contrast-dependent grouping of the neuronal activity. Similar to the visual system, in our model high contrast stimuli need less time to reach higher network levels than low contrnst stimuli. This leads t o a large temporal separation of the actiirity of different cell assemblies, which strongly improves undisturbed internal synchronization. A different approach that also utilizes latencies for image segmentation has been suggested by Burgi and Pun (1994). Their model is based on filter functions that include luminance-deFendent delays to separate
Visual Latencies to Improve Image Segmentation
1495
a scene. The results from the different spatiotemporal filters are temporally integrated recombining some aspects of the scene. During the whole process only the most relevant primitives (e.g., edges) are used to recognize an object. Opposite to the model of Burgi and Pun (1994) our model is based on spiking neurons and its architecture contains a certain similarity to that of the visual system of vertebrates. In addition, our model does not rely on the detection of primitives, instead-similar to the retina-all aspects of a scene are taken into account and the latencyinduced separation will be maintained throughout all layers. The final binding of objects will be achieved in our network by a conventional synchronization mechanism, and the conceptionally new latency algorithm is used to enhance the speed of synchronization for cell assemblies representing different objects. We will first give an overview of the network, by describing the dynamic of the neurons and the connection structure, then we will show some typical simulations and give a quantitative estimate of the performance of the network. Finally we will discuss the limitations of the approach and also the question of possible physiological relevance. 2 Description of the Network
2.1 The Dynamic of the Neurons. The behavior of the network is given by the connectivity pattern (Fig. 1) and the dynamic behavior of the neurons that can be characterized as integrate and fire units. In our algorithm synchronization of neurons is achieved, applying a concave membrane potential time-function U (t) [U"(t)< O] (Mirollo and Strogatz 1990; Ernst et al. 1994; for more details about the synchronization see Appendix A). The dynamic of each individual neuron without connections can be described by
U(t,+l)
=
F[t,
U(t,,l)
=
0.
+ At] = F{F-'[U(t,)]+ At}.
if U ( t , )< 8 else
(2.1)
with:
F:
F-':
concave function that describes the membrane potential. In our simulations we use F ( f ) = 1 - exp(-t/T), and inverse function of F.
The notation that uses the inverse function will be needed as soon as connections between neurons are considered. F describes a low-pass difference equation with a time constant T that in the first layer depends on the image contrast 1. Neurons that are stimulated by a high contrast defined with respect to the background are firing with a higher frequency (small time constant) than neurons that are stimulated by a low contrast (large time constant).
Ralf Opara and Florentin Wijrgntter
1496
8 is the firing threshold. If the membrane potential U reaches the threshold (3 a spike is generated and propagated through the network. In the next time step U is reset to zero. We used the following parameters for the network dynamics:
8
=
7
= =
7
0.9999, 0.15 / contrast, in layer 1 1, in layers 2-9
Spikes that are propagated will lead to either an excitatory or an inhibitory action at the target cell(s) shifting the membrane potential U (t ) by a total of V,. Thus, the complete dynamic of a neuron is described by
U(t,+,) U(t,,I)
=
=
F{F-'(U(t,)]+ A t } t V,, 0,
if U ( t , )< (-1
else
(2.2)
The total contribution of the connections from all neurons that spike at time t, is given as
Thus, summation is performed over all neurons k that spike at time t , . The individual values of I l k are given in Appendix B. The connection strength that determines the absolute value of U A mainly depends on the distance between source and target cell and the type of connection made (see below). Because U describes a concave function, synchronization occurs necessarily as soon as V , is unequal to zero (Mirollo and Strogatz 1990; Ernst ct d.1994). The coupling V , within and between layers is symmetrical and is only a function of distance. It leads to an excitatory interaction within layers over short distances, while inhibitory coupling is predominant at long distances. The coupling strength decays exponentially with increasing distance. For details see Appendix B. Since the firing frequency is not constant in the lower layers, we cannot define the phase in a conventional way. To derive at a measure suited for our purposes, we define the "phase" @ of a neuron i with respect to the actual time t as the minimum time difference between its own firing of two subsequent spikes and the temporal average of the firing of all neurons, divided by the firing period. Thus, the phase @ of a neuron i is
a, = (Dl(t)= 360"
trneaii -
min[t
TI - ( t - tf"')]
-
Ti
(2.4)
where is the time when neuron i has fired the last time, TI is its firing period, and t is the actual time. The temporal average of the firing of This yields a T,- ( t all active neurons is Pea" = C, min[t consistent phase measurement in all layers and is defined for all values t on the time axis.
Visual Latencies to Improve Image Segmentation
1497
2.2 The Network. In the network two functionally different areas can be distinguished. The first area is mainly responsible for the contrast dependent latency differences. In Figure 1A it is labeled as retina and LGN. The second part labeled uisual cortex causes fast synchronization and feature binding, which is achieved by lateral connections. The network consists of nine layers. Neurons of each layer are part of a three-dimensional grid. For simplicity in Figure 1A only a twodimensional frontal view is shown. The central column of Figure 1A shows the connectivity pattern of the multilayer neural network. Other columns are wired up in the same way, and the same connectivity also applies for the third dimension, which is not shown in Figure 1A. The input of the network consists of a two-dimensional gray level image that is fed into layer one. The process of image segmentation consists of different processing stages that are arranged in a hierarchical way. In the first processing stage the activity of cells coding certain objects needs different times to activate cells in the higher layers, due to their different latencies. The next layers (above 3) include local interactions between adjacent cells. Because of the locality of the interactions the activity of several objects can coexist in these layers. As the activity is propagated to higher layers, larger cell populations are involved in the synchronization/desynchronization process as a consequence of increasingly wider lateral connections. At the readout layer (in Fig. 1A layer 9) a representation of global context is achieved. Note, however that we are not explicitly concerned with the readout itself.
2.2.1 Latency Layers. The first three layers in the network produce the latencies that depend on the stimulus contrast with respect to the background. Thus, these layers can be characterized as a contrast-dependent delay line (Fig. 1B). To introduce delays in our system we use frequency coding at the input stage that depends on feature contrast. To excite neurons in each following layer long integration times (2-3 cycles) are necessary (Appendix B). The resulting latencies in computer time units are shown in Figure 2 as a function of stimulus contrast. As parameter the integration time of the neurons is shown. The latencies in Figure 2 are given in arbitrary computer time steps, because the network layout provides no constraints on the physiological parameter ranges that could be used to normalize the time-axis in, e.g., milliseconds. Neurons with a high firing frequency (high feature contrast) at the input stage need less time (short latency) to excite cells in layer two and three than neurons with a low frequency (low feature contrast). Propagation through the layers leads to a further amplification of the latency differences until the activity reaches layer 4, which is the first layer where the lateral connectivity sets in. Both effects (initial frequency differences and delay-amplification through the layers) effectively mimic the different latencies of a real retinal signal.
Ralf Opara and Florentin Wiirgcitter
1498
9 8 7
Figure 1: (A) Schematic diagram of the nctwork in a frontal view (third dinwnsion omitted). Lateral connections are shown only for the central column. At this processing stage only feedforward connecti\Fity is used, to t'nsure a linear information flow to the ilistinl cortrs without interference between different cell assemblies.
2.2.2 S!irzclzroizizntiott Lmycw. The next stage synchronizes the neurons, which leads to feature binding. The oscillatcx characteristics of the neurons assures that almost constant firing frequencies are achieved in the upper layers. In the visirnl cartPs four different types of connections can be distinguished, which will be described subdividing them in connections within a layer and those between layers. Connections within a layer: 1. Lateral excitatory connections between neighboring neurons. In the lower layers connections are made only between directly adjacent neurons. The locally restricted connections will prevent that objects disturb each other during the process of forming a cell assembly. This yields to a fast synchronization of cells belonging to a cell assembly coding a certain object, even if more than one object is represented. The radius of lateral excitation increases in the upper
Visual Latencies to Improve Image Segmentation
1-3
1499
rl
Figure 1: (B) The functional components of the network. The input is piped into a delay line. The duration of the delay depends on the stimulus contrast. In the next processing stage cell assemblies are formed and cell assemblies coding objects with different contrasts are spatially (Le., temporally) separated. layers. This makes it possible that increasingly larger regions are involved in the synchronization process. 2. Lateral inhibitory connections over larger distances. The inhibitory connections within a layer have two effects. At first they will force cell assemblies that are active at the same layer at a certain time to fire with a phase shift within the oscillator period. The second effect is a competition of cell assemblies to reach the next layer if they are coding different objects. Due to the competitive character of this connectivity pattern some objects will reach the next layer faster than other object. The competition continues up to the readout layer, thereby reducing the number of objects represented at a
Ralf Opara and Florentin Wiirgotter
1500
1000 v)
c .-
C
800
9
E
.-
600
c
400
200
0
0.2
0.4
0.6
0.8
1 .o
contrast Figure 2: Latency of activity distributions to reach layer 4 as a function o f stimulus contrast. Several curt’es are plotted according to the integration time (in cycles) necessary to fire the next layer. certain layer and, as a consequence, the complexity of the visual scene that is processed at a certain time. The magnitude of the inhibitory connections is much smaller as compared to the excitatory connections between adjacent neurons within a layer, so the synchronization of an already formed cell assembly remains nearly unaffected. Connections between layers: 1. Direct feedback inhibition between layers. If the activity distribution of cells coding one object reaches a certain layer I , the activity of the corresponding neurons at layer I’ (with I’ < I ) is inhibited, to prevent the continuous arrival of input. For example, a continuous arrival of input activity from a high contrast stimulus in the lower layers would disturb adjacent low contrast stimuli that just reached the same layer. Feedback inhibition leads to the restriction of the propagating signal for each object to one or two layers [in V1 (Fig. 3)1 at a certain time. 2. Lateral excitatory feedback connections. These connections are introduced to speed u p the propagation of lagging activity, which is caused by noise in the feature intensity or the latency. The weights of these connections are smaller than the weights of the lateral connectivity pattern within a layer. This leads to the effect that locally
Visual Latencies to Improve Image Segmentation
1501
formed cell assemblies are influenced little while lagging activity of a few individual cells is pulled forward to be included into its cell assembly. 3 Results 3.1 Example. As an example, Figure 3C shows the activity and phase distribution of the network response with parameters shown in Appendix B to a stimulus shown in Figure 3A. The stimulus consists of two square objects (o.ij) with a high contrast and a third ( 7 ) low-contrast object. The squares o and 7 are adjacent, while the square d is spatial separated from them. In Figure 3C (middle) each pixel represents the firing phase of one neuron. The phases are defined for active neurons which fire more than once and are calculated according to equation 2.4. White pixels represent inactive neurons. Columns show the phases of all active neurons at selected times for all layers of the network, while rows represent different times but a fixed layer. In the first three layers (rows 1, 2, and 3 in Fig. 3C middle) the lack of lateral interactions leads to a random phase relation @ between all neurons. Due to the latency mechanism, features with a high contrast ( t ~and /j in Fig. 3A) need less time (M 500 iterations) to reach layer 4 (row 4 and columns 2 and 3), while features with a low contrast (y in Fig. 3A) need longer times (= 1150 iterations) (row 4 and columns 7 and 8). It should be noted that we assume low background luminance for all simulations presented in this study, which prevents “background neurons” from firing. If higher luminance levels would be used the background would be regarded as an additional object and background neurons would synchronize. The activity of cells coding the same object is initially spread out over several layers. E.g., at column 4 in Figure 3C the activity of square o (Fig. 3A) is distributed from layer 1 up to layer 6. This occurs because the input images are noisy (25% noise). The processing in the grouping and synchronization layers will concentrate the activity distribution of cells coding a certain object to one layer at a certain time. In Figure 3C the activity of square (v is grouped at the same layer after passing layer 6 and the cell assemblies that are coding one object are firing synchronously, coded by the same gray level in Figure 3C. Features with a high contrast difference are separated across the model layers after a few iterations (column 2 in Fig. 3C). This spatial (viz. temporal) separation of the activity permits an independent processing of features with different contrast, minimizing the interference between different features after the first processing stages. As a consequence, the speed of the synchronization between neurons coding the same feature is strongly enhanced and in the higher layers individual features are
1502
Ralf Opara and Florentin Worgotter
B)
-
-90" 0" 90" phase
C) 9
8 7 6 5
4 3
2 1 -b
__
0
._
__I)
frequency
iteration steps ___..
Figure 3: (A) The stimulus that was given to the network consists of two square objects (o,,j) with a high contrast and a third ( 7 ) low-contrast object. The squares ct and -/ are adjacent, while the square , jis spatial separated from them. (B) Gray-level coding of the phases. (C) Left: Horizontal bars indicate the standard deviation of the firing phase (7 of the neurons that belong to one object (top=cr, middle=,j, bottom=:,) averaged for the complete run. Standard deviation drops to zero in the higher layers. Middle: Time course of the activity distribution at selected iterations, which are 31, 451, 711, 932, 951, 1029, 1132, 1161, 1352, 1549, 1584, 1773, 1978, 2162 for columns 1 to 14. Pixel gray levels represent the firing phase. Above layer three two objects ((t,,j) are separated in time from the third (?), which has a longer latency. Neurons which represent one object acquire synchrony (i.e., constant phase, which i s indicated by constant gray levels) in the upper layers and neurons coding objects with the same contrast ( ( I and ,j)are firing with different phases. Right: Average firing frequency for all cells. The frequency is higher in lower layers with two peaks representing the three objects (neurons coding o and are firing with the same frequency) and constant at a single lower value in the model cortex.
Visual Latencies to Improve Image Segmentation
1503
Table 1: Comparison of the Performance of Two Networks with (9 Layer) and without (1 Layer) Propagation Delays." 1 layer 9 layer No. of objects Time/ time units cpu time/sec Time/time units cpu time/sec 2 3 4 5 6 7 9 12 15
98 104 218 562 2627
39 42 83 209 1010
-
-
-
2179 2193 2230 2281 2310
570 579 611 652 678
2606 2861 3148
899 1096 1305
"The task is to recognize 2 up to 15 objects in a visual scene. For each model the iterations and the CPU-time on a SUN Sparc 20 for recognizing all objects is shown. A total of 900 neurons per layer have been simulated.
bound together to form a consistent representation of the objects in the visual scene. 3.2 Comparison to Networks without Latency. In this section we will compare the multilayer network to a one-layer network without latency. The one-layer network has the same dynamics and the same connectivity pattern as layer 9 in our network. The task of both networks is to recognize 2 up to 15 objects in a two-dimensional input image. For each network the time is calculated until all cells coding a certain object are firing synchronously. The time for recognizing all objects is summarized in Table 1. If only a few (2-5) objects are presented to the networks, the one layer model is faster, because of the delays in the multilayer network. The opposite effect is observed for many (> 5) objects. Here the interference between the different cell assemblies in the one-layer network leads to a long time until synchronization occurs. For more than 6 objects synchronization is prevented in the one-layer network. On the other hand, in these cases the synchronization of cell assemblies in the multilayer network is still very fast, because of minimal interference between different objects. In addition, it should be noted that the time for recognition remains almost constant in the 9-layer network, which is a desirable feature, whereas a steep nonlinear increase is observed in the 1-layer net.
4 Performance Tests
The performance of the network is determined in a series of simulations, in which some basic parameters, contrast, noise, shading, latency param-
Ralf Opara and Floreiitin Wiirgotter
1501
eter, and the number of adjacent objects, are varied. The performance is measured by the quality of object segmentation in layer 9. We avoid using classical correlation methods like auto- and cross-correlation, which are not well suited to analyze the coherence in our simulations, because sychronous activity always occurs in short bursts in the readout layer neurons, which are otherwise silent. In addition cross-correlation would have to be performed mutually between many ,:( 10) neurons, which is computationally \'cry expensive. Alternatively we propose a quantification that is achieved by the iritcwrnl cc~licwiit-rrntc within a cell assembly (c,,,~) coding one object and the i ’ . Y h ‘ m 7 / c o l r m i i r r n7tc between cell assemblies coding different objects ( L - ~ , ~ ) . This way ~ 7 eget two similarly derived values for each simulation, which can be plotted simultaneously in a three-dimensional diagram (see Figs. 4-7). Coherence rates are calculated for layer 9 only. Thus the following description refers to this layer. The internal coherence rate measures the relative part of synchronously active cells as compared to the maximally possible synchronous cell activity for each individual object in the scene. For one object ( I it is calculated in the tollowing way. Because our stimuli are rather simple we know how many cells should represent object (I and in the first step all spikes from all those neurons are collapsed into one trace. Multiple spikes occurring at the same time step are considered. From this trace m event-histogram is compiled for each iteration step using a sliding window with width 10 iterations. The time window (10 iterations) defines the maximally allowed temporal jitter for synchronously firing cells. This window is less than 20’14,of an average oscillation cycle so that each cell can fire only once in the window. Thus, the local maxima of the histogram always represent combined activity across cells. Accordingly, we call each local maximum an nssnnbly ~7cfiz~~~ficiir. We can now determine the occurrence times t , o f these assembly activations. Note that the assembly activations can sometimes consist of a single spike if the spike train is highly dispersed in the case of nonsynchronous firing, where no true cell assemblies exist. In the next step we determine how many spikes are involved in generating each assembly activation. The number of spikes 11;’ included in a n assembly activation is determined by counting all spikes in the original collapsed trace, which occur close to the assembly activation, i t . , within the allowed distance of t , i 5 iterations. Because of the known dynamics we also know how often the neurons of object r i should fire synchronously for each assembly activation (N‘)).Thus, we can compute the internal coherence rate by summing the minima’ of (N“- ti;’) and J J ; ’ , ~ _ _ -__ _ _ _ _ _ ~ ‘The minimum is taken, because the temporally breaking up o f ail expected activity group into two groups of equal sire, (i.e., 11;’ = N“i2) reprebents the worst case for a coherrnce ebtimation as compared to a situation where only o n e o r two spikes ‘ire ~
~~~
~
Visual Latencies to Improve Image Segmentation
1505
The minimum of N" - ny and ny is summed for each assembly activation over the whole spike train in layer 9 eliminating the time dependency. If all cells of a cell assembly are synchronized the sum over the minimum of N" - n: and n: is zero, while in the case that no cells are synchronized it is k x N" = NEax(if each neuron fires k times). The normalized internal coherence rate tint is thus given by Ci"t
:= 1 -
1 ~
%ax
zmin(n:. N"
,
-
n:)
with
NgaX: all spikes in the collapsed spike train of object N,
N":
total possible number of spikes in each assembly activation, and actual number of spikes in the assembly activation.
n::
A high coherence rate indicates synchronicity of cells in the assembly over the whole spike-train, while a low coherence rate indicates uncorrelated activity. The other measure we use is the coherence between two cell assemblies ( v and 1) coding different objects defined by
with W;::
all spikes of objects ( 1 and 11, number of firing cells in the ith assembly activation of cell assembly N, nf: number of firing cells in the jth assembly activation of cell assembly , j , t,. t,: occurrence times of the assembly activations, and At: maximal time difference for synchronously firing cells (10 iteration steps). This sum grows only if assembly activations for the objects (I and . I occur simultaneously, i.e., within 10 iteration steps. Thus, the external coherence rate indicates if different cell assemblies belong together (text = 1) or not (text = 0). The coherence rate is calculated for several simulations (mostly 12) and the results are averaged considering their errors. The central advantage of these two measures is that they are derived similarly and can, thus, be plotted in one diagram.
n::
missing in a group. Accordingly: N" - 1 > N" - ( N " / 2 ) , if N" > 3, but min(N" - 1.1) < min[Nn - ( N " / 2 ) .( N " / 2 ) ] ,if N" > 3.
Ralf Opara and Florentin Worgotter
1506
1.o
05 m
-r a, x
00
3
43
150
tigun. 1: ( A ) The stimulus consists of t\vo adjacent squares. The contrast between the squares is \ . a r k 1 from 3 up t o 13"tv. (B) The external (surface) and ~nternal(gray-le\.eldistribution) coherence rate i,>,t o f two cell assemblies representing two adjacent objects as a function of contrast and latency parameter. 4.1 Coherence Rate as a Function of Contrast and Latency. To use the network in an image segmentation task it is necessary t o know the netnwrk response to differtmt kinds of stimuli as a function o f some basic parameters. The most important parameter is the latency. It determines the extent of temporal separation at a certain layer or-which is equivalent-the spatial separation of acti\,ity at a certain time as a function of contrast. In several simulations a stimulus was presented t o the .;!stem, consisting of to adjacent squares (Fig. 4Al-A3) with a different contrast varying from 3 to -1.3";)between them. The latency parameter indicates the time (not normalized) between stimulus onset and the occurrence of the first response in layer 4 (Appendix B). To examine the behavior of the system the internal and external coherence rates are calculated according t o equations 4.1 and 4.2. The external coherence rate is shown in Figure 4B as surface, while the internal coherence rate is presented as gray-level distribution on the surface. The internal coherence rate is proportional to the amount of white in the gray-le\d distribution. In Figure 4B region 1 indicates little interference betwecn the cell assemblies (low external coherence rate c,,,t) and a high coherence rate within each cell assembly (high internal coherence rate c-,,,,). Two assemblies are formed with an activity that is independently propagating through the network. Region 2 is characterized by an intermediate external i,,t and internal c,,,! coherence rate, where the
Visual Latencies to Improve Image Segmentation
1507
relation between the cells is ambiguous. The network does not form distinct cell assemblies; instead, individual cells switch between different states. In this region the system cannot clearly decide if the stimulus consists of one or two objects. In region 3 the external text and internal ct, coherence rates are high and all cells are firing synchronously, forming a single assembly. At short latencies the ambiguity range (region 2 ) is large because the transition from region 1 to region 3 is shallow. Up to longer latencies the transition is steep and the ambiguity range is small. 4.2 Robustness against Noise. A system, utilizing the contrast differences of objects, must be tolerant of any unintentional disturbance caused by noise. The stimulus that was given to the network is shown in Figure 5Al-A3. It consists of two adjacent squares with a contrast of 20% between them. The luminance I of the individual pixels varies according to
[rand(noise,,,)
-
0.5 . noise,,,]
(4.3)
100
with
I,] : luminance without noise, rand(noise,,,) : positive random number smaller than noise,,,, noise,,, : maximal noise in Yo of 10.
and
Internal and external coherence rates are computed and averaged as before (cf. Fig. 4B). In Figure 5B four different areas can be distinguished. The first region is characterized by a low internal coherence rate and a low coherence rate between the cell assemblies. This is the case at very long latencies (latency parameter = 140) and very noisy stimuli (50% noise). If the latencies are too large, synchronization is difficult to achieve, because noisy image parts are artificially separated over long distances. The second region is characterized by a good internal synchronization and low coherence rate between the cell assemblies. This behavior occurs at an intermediate range of latencies (latency parameter > 120 at 50%, noise and latency parameter > 110 at 10% noise). In this parameter range the network groups the activity of the cell assemblies internally and ceII assemblies coding different objects are separated from each other (low coherence rate between cell assemblies). In region three the internal coherence rate is low, while the external coherence rate is at an intermediate level. Cell activity interferes with each other and the cell assemblies are not able to group their activity internally over the whole spike train. The cells switch between the states ”one object” and ”two objects.” In such a region the system cannot make a clear decision, but this is only a small range.
1508
Ralf Opara and Florentin Worgotter
Figure 5: ( A ) The stimulus that was given to the network consists of two adjacent squares with a contrast of 20"() between the squares. The noise-induced variation o f luminance is calculated according to equation 4.3. (B) Coherence rate c,,~ between two cell assemblies is shown as surface and the coherence rate c,,,~ within a cell assembly is coded as gray-level distribution on the surface. In this representation the latency parameter of the network response and the noise of the stimulus is varied. The last region shows high internal and external coherence rate. The cells are internally grouped together and the two cell assemblies are synchronized. This indicates a decision: one object is recognized. 4.3 Shaded Surfaces. In a natural environment it is quite common that the luminance changes continuously along the object surface. An important feature of a contrast-based system must be the tolerance against such shaded surfaces. The connectility pattern o f the network, as mentioned above, leads t o a synchronization and grouping of activity o f cells representing similar contrast and spatial neighborhood. Shaded surfaces are characterized by smooth contrast changes within the boundary o f the object. As a consequence of our model, shaded surfaces with a small shading gradient are much more strongly grouped together than surfaces with a steep gradient. In a series of simulations the response of the network has been calculated as a function of the gray-le\d gradient and latency. The internal (equation 4.1) and external (equation 4.2) coherence rates are calculated and shown in Figure 6B. The stimulus is shown i n Figure hAl-A3 and consists of two squares with a contrast of 25"0 betizeen them, which are
Visual Latencies to Improve Image Segmentation
1509
Figure 6: (A) The Stimulus consists of two adjacent squares with a smooth transition. The contrast between the left and right edge is 25%. (B) Coherence rate text between two cell assemblies shown as grid and the coherence rate tint within the cell assemblies coded as gray-level distribution on the grid, as a function of latency and the gradient of the transition between the squares. connected by a gray-level ramp. The contrast of the squares is fixed, while gradient of the gray-level ramp is varied from 2.7 to 10 pixel-'. In Figure 6B three different areas can be distinguished. The first region is characterized by a high internal synchronization but a low coherence rate between the cell assemblies. Here the gradient of the shading is so steep that the activity is grouped into two different cell assemblies. This is the case for shadings with a large gradient for latency parameters greater than 100 and for smaller gradients at long latencies (latency parameter = 125). In region two the internal coherence rate is low, while the external coherence rate is at an intermediate level. Cell assemblies are not able to group their activity internally and the cells are unpredictably switching between different phase relations. The last region ( 3 ) shows a high internal and external coherence rate. The cells are internally grouped together and the two cell assemblies are synchronized, thus, coding a single object. 4.4 Influence of Adjacent Objects. The simulations above have been performed for two adjacent objects that are represented by two cell assemblies of the same size (Figs. 4B, 5B, and 6B). This means that only one contact surface exists. In Figure 7B the influence of many contact surfaces
1510
Ralf Opara and Florentin Wdrgiltter
Figure 7: ( A ) The Stimulus consists of two objects. One bright square in the middle of the image and one surrounding object with one up to four contact surfaces t o the square in the middle. The contrast between the objects is 20?,1. (B) Coherence rate ( c ~ , as , ~ grid and tint as gray-level distribution at the surface) of cell assembly 1 (high contrast) and cell assembly 2 (low contrast) as a function of the number of contact surfaces and the latcncv. is shown. The stimulus consists of a high contrast square in the middle of the image and several rectangles with a lower contrast that are placed around the high contrast square (Fig. 7A1-A3). As above, we can distinguish three different areas in Figure 7B. The first region is characterized by a high internal synchronization and low coherence rate between the cell assemblies. This occurs at long latencies (latency parameter greater than 305 with only one contact surface and at latency parameter greater than 110 with four contact surfaces). For four contact surfaces this region is shifted only a little bit to higher latencies as compared to the network response for one contact surface. This indicates that many contact surfaces will have only little effect on the network response. In region two the internal coherence rate is low and external coherence rate is at an intermediate level, and as in the examples above (Figs. 48,5B, and 66) no grouping is achieved. The last region shows a high internal and external coherence rate. The cells are internally grouped together and the two cell assemblies are synchronized to represent one object. Thc effect of many contact surfaces is characterized by the tendency to bind the cell assemblies together a little bit earlier compared to only one contact surface. This effect, however, is small related to other effects (Figs. 38, 56, and 6B).
Visual Latencies to Improve Image Segmentation
1511
The simulations (Figs. 4,5,6, and 7) show that the network can simultaneously treat different combinations of individual stimulus parameters. A significant overlap of valid decision surfaces exists for all four kinds of stimulus parameters used in these figures in large ranges. This is most easily demonstrated in an example of only two stimulus combinations, e.g., Figure 4 (contrast) and Figure 5 (noise). For a combination of these parameters in the ranges 15-25% contrast and 10-30% noise the network will still bind the two cell assemblies if the latency parameter is lower than 85 while for a latency parameter greater than 115 it will separate them. This example can be extended to more than two stimulus parameters. 5 Discussion
In this study we combined several physiologically inspired aspects of real neural networks (i.e., oscillation, synchronization) together with a latency mechanism to a network that simultaneously achieves efficient object binding and image segmentation. The following discussion shall be devoted to two questions: (1)What are the advantages and the inherent performance limitations of the network? and ( 2 ) Is this mechanism physiologically feasible? 5.1 Advantages and Limitations. The central advantage of our network relies on the fact that the most general feature of each image-its contrast distribution-is directly used to control the synchronization process. Features with a high contrast are thereby favored and reach the last network layer faster than objects that do not "jump to the eye." This mechanism thereby mimics human perception (Burgi and Pun 1991), but, more importantly, it efficiently limits the information flow that needs to be evaluated in the network at any point in space (viz. time). In other approaches a small focus of attention is used (Milanese 1993; Niebur ct al. 1993; Olshausen et al. 1993; Fellenz 1994; Niebur and Koch 1996; for a recent review see Posner and Dehaene, 1994; Desimone and Duncan 1995) to also impose spatial restrictions on the information flow. In most cases the focus of attention is shifted through the network either arbitrarily or by means of a precomputed saliency map. The term "saliency," however, is not well-defined and what shall be regarded as salient is mostly relying on the taste of the network designer. In our design we circumvent the problem of how to define and shift a focus of attention and instead rely entirely on the inherent saliency differences of features that have a different contrast. As a result, our network is rather simple and almost no restrictions are imposed on the design as long as it contains a delay-line and is able to synchronize its units. In our model synchronization occurs only between spatially adjacent units over increasing lateral distances. This apparent disadvantage could
1512
Ralf Oparn and Florentin Worgiitter
be overcome by a more complicated lateral connectivity, which can lead to a synchronization within almost any kind of desired geometry. Another restriction of the network is even more prominent: In most cases objects with a graded shading of a high contrast will be cut by the network into two or more pieces traveling separately through the layers. There seems t o be no imminent way of how to overcome this restriction without changing the network architecture to a large degree (e.g., by introducing feedback loops and/or higher order feature detector neurons). In general our network will fail as soon as higher level “cognitive decisions” have to be made. A s a solution to this problem, in the brain of vertebrates feedback-loops, higher order feature detector neurons, and more complex receptivc fields are introduced along the hierarchy of the \%ual pathway in order to finally lead to our own advanced abilities for image analysis. Our current knowledge about the network o f the visual pathway, however, is still so strongly limited that it remains unknown how to assemble these aspects t o create a generic image analysis network (for physiologically oriented models see, e.g., WGrgBtter and Koch 1991; Somers c.t nl. 1995). Therefore, so far all artificial synchronization models introducc a certain degree of arbitrariness in the design as soon as higher level image segmentation and binding are t o be achieved (Sporns c’t 01. 1989, Tononi ct 171. 1992). The two restrictions o f our network discussed above (spatial neighborhood restriction and graded shading restriction) can probably be lifted by rather simple additions to the design. The goal of this study, however, was not to try to achie1-e compiex n e t ~ w r kperformance by an arbitrary connection structure but rather to promote a novel idea of how to improve image segmentation by a very simple, generic, and physiologically plausible (see below) mechanism. It should be noted that even some of the existing connections in the network (e.g., feedback excitation, feedback inhibition) could be remoi,ed without altering its basic behavior. These connections hare been introduced to make the system more robust against noise in the stimuli and they are not responsible for generating latencies or leading to synchronization. As an additional problem it should be realized that the network architecture cannot directly be transfered onto the brain, because a strict layering over more than three stages is not existing there. Abeles t’t 01. (1993)have, howe\Ter, introduced the idca of a synfire chain, which essentially represents information processing by subsequently excited groups of neurons. This idea supports a more serially arranged processing structure as demanded by our model. The latencies in our network are defined by different contrasts. It would, however, also be possible to employ more complicated aspects of an image or image sequence (e.g., the distributions of orientation, depth, and motion) to define the latencies. In fact, Niebur and Koch (1996) proposed using a combination of different image features with a predominance o f motion to create a saliency map that defines how the
Visual Latencies to Improve Image Segmentation
1513
focus of attention is shifted across the image in their model. This could very well be regarded as a clever pipelining of several spotlights that would be processed subsequently.
5.2 Physiological Background. The model we presented in this study is far from any biological realism but rather it is meant to demonstrate the applicability of a combination of several physiologically inspired mechanisms. There are nevertheless several aspects that bear a certain degree of similarity to the visual system. From psychophysical experiments it has long been known that latencies can have a distinct influence on visual perception. Most famous is probably the Pulfrich effect (1922, see Enright 1985) where luminance differences between both eyes result in a depth percept of a stimulus that is actually moving in an equidistant plane in front of the observer. More recently evidence was provided that the Pulfrich effect is actually reflected at the level of single cell behavior in the visual cortex (Carney et al. 1989). Visual latencies do not depend only on the contrast but also on color and spatial frequency of the stimulus. In addition, they depend on the state of light adaptation. All these effects do not interfere with the general idea of our model, because for us contrast represents only the simplest way of creating a functional dependence for the latencies, which could, however, also be replaced by more complicated combined dependencies. While the physiological findings point to the general importance of latencies for visual perception, it is still unclear if a delay-line/synchronization mode1 such as ours could reflect reality. As a basic requirement our model demands that synchronous oscillations directly follow the onset of a visual stimulus. Whittaker and Siegfried (1983) in a study of visually evoked potentials provide evidence for such a stimulus-locked oscillation onset which we could also reconfirm (Worgotter ef al. 1996). Over the last years there is a heated discussion going on about the physiological relevance of models that use synchronization for feature binding (see Singer and Gray 1995 for a recent review). It does not seem to be appropriate to enter this discussion here because in the context of this study the major focus lies on the latency mechanism. In fact, Burgi and Pun (1994) demonstrated that a delay mechanism can be used to improve image analysis without employing synchronization in an oscillator network. The difference between the approach of Burgi and Pun and our model lies mainly in the fact that they used spatiotemporal filters and a convolution algorithm for image analysis whereas our approach is an artificial neural network. There are, however, several questions associated with the latency mechanism, five of which shall be discussed in the following.
1. Are the latencies in the visual system long enough?
1514
Ralf Opara and Florentin Worgotter
Several studies have sholvn that the \,ism1 delays in V1 (or area 17, cat) of the cortex lie approximately within the range of 30 to 130 msec depending on the contrast.’ Given an oscillation period of between 15 and 311 msec the latency rangc would cover about 3-6 oscillation cycles, which could be enough to subser\pe our purposes.
2. Are the latencies narrcnvly distributed?
The answer to this question is plainly J I O . Statistical fluctuations at the level of the retina will influence the indi\.idual latencies of cortical cells. The X- and k u b s y s t c m s have different propagation delays (Bolz r l a l . 1982; Sestokas t’t d.1Y87). The time to response of an individual cell dcpends on its size and the distribution of the input synapses. These and othc.r effects lead to a significant broadening of the latency distribution at the Ie\.el of tht. cortex (Dinse and Kriiger 1993). Thc strong noise robustness of our network (see Fig. 5,1,however, points to the fact that a rathcr large degree of broadening in the latency distribution is easily tolcmtecl, so that this problem seems t o be of minor rele\,ance. 3. What happens if objects ha1.e onl!. a slight contrast difference? After all humans are able to distinguish between those, too. A slight contrast difference will certainly not lead t o truly \%ible latent!, differences; it could, howe\.er, be sufficient to induce small initial temporal differences (phase differences) that could also facilitate a mutually undisturbed synchronization. I t is concci\.able that networks with feedback loops could amplify the small initial differences kvhich result in a good separation of the objects.
4.What happens !Then the network changes to a new scene? For example, a bright object in the new scene could “overtake” the dark objects from the old scene that are still processed in the network and this would cause a disturbance of the synchronization process. The simplest solution that would a\roicl such a problem would be to introduce a reset mechanism that erases the network activity entirely before a new scene can enter. In the lisual system the saccadic eye movements could indeed reflect such a reset mechanism by saccadic suppression that is in most cases associated with saccades, in particular in the magnocellular pathway (Burr r t 171. 1994).
5. What happens if a bright object enters the scene? ‘This range was estimated from se\-eral studies (LCI ick 197.1; Bolr rii. 1982; Sestokas 1987) and own observations (W6rgbtterc’f : I / . 19%). The major FrObkvn of a reliable latency estimate is the question of \vhat to ,~ctuall!. mcasurc: The first, second, third, etc. spike? n l e time until a significant proportion o f the response has built up? Or othcr mt’asures. Most of the used measures suffer from the effect that they are accuratc liie\ can onl\ be regardcd as estimates. c9f
I.fii/.
Visual Latencies to Improve Image Segmentation
1515
If this happens the network performance could be reduced, but only in the unlikely case that the activity of the bright object would get close enough (spatially and temporally) to other active neurons so as to disturb their internal synchronization. A moving stimulus represents a very salient feature (Niebur and Koch 1996) and it is conceivable that such cases are strongly governed by attentional mechanisms that are not included in our model. 5.3 Conclusion. The intention of this study was not to design a biologically realistic model but rather to outline an artificial neural network that represents an abstraction from the real visual pathway. It was attempted to demonstrate the applicability and the limitations of the approach, and to achieve this we have restricted ourselves to a very simple architecture that focused only on the algorithmic principles that were to be tested. The enhancement of assembly formation achieved as the result of the latency mechanism can be used as a first but highly efficient step in image analysis problems. It is very likely that these originally formed cell assemblies provide not more than the basis for an immediately following regrouping by other mechanisms to analyze more complicated visual problems. While we did not attempt this, it is clear that the model provides no restrictions for a possible rearrangement of the activity that could, for instance, be achieved by feedback from higher visual areas (Sporns et al. 1989, Tononi et al. 1992). We were able to show that visual latencies could be used in a rather robust way to enhance image segmentation. If future work should show that the proposed mechanisms to utilize latencies does not reflect the computational reality in the visual pathway, then the intriguing question remains how the brain would analyze a visual scene that is indeed stretched in time by the unequivocally existing latencies?
Appendix A The Synchronization Process The process of synchronization of two neurons A and B (Fig. 8A) with excitatory connections is shown in Figure 88. Both cells are driven by the same stimulus and the initial phase between the oscillators is 180" (distance between 1 A . 1 B ) . If neuron A fires at time ( 1 ~the ) membrane potential of neuron B ( 1 ~ ) is increased by an amount F, which in this example induces a shift in phase of about h@B = 77" (1;). Because of the small gradient of U ( t ) at l Bthe small shift by F leads to a large shift of the relative phase between neuron A and B (Fig. 88). After firing the membrane potential of neuron A is reset to zero (1;). At time ZR neuron B reaches the threshold B and also fires. The membrane potential of neuron A ( 2 A ) is increased by F (2;). Although the membrane potential is again shifted by F the effect of the phase shift h @ A = 13"
Ralf Opara a n d Florentin Worgotter
lilh
exitatory connections
neuron A
neuron B
input
Figure 8: ( A ) Diagram o f two excitatorily connected neurons A and B.
is smaller than at time step 1, because the gradient o f the membrane potential U ( t )at time 2,, is much steeper than at time lH. The resulting phase = 180’ - Mptf, after both neurons have fired, is decreased by r’(D,,, = b c I ) ~- ( \ c € , . ~ = 77- 13. = 6.1.. This procedure is repeated seiwal times, which yields synchronously firing cells already after 3 firing cycles. Note, the results presented in this paper are little affected by the special choice of neural dynamics as long as the dynamic leads to a synchronization betwreen the neurons.
+
~
Appendix B: Connectivity Parameters
~
-
-.
This appendix shows the parameters used in the simulation shown in Figure 3. In some of the other simulations small modification of these parameter tables have been made. The following tables show the connection strengths between neurons. A common multiplication factor is given with which all values have t o be multiplied. The leftmost column is the source cell column. All connections are made from those source cells to the target cells with the connection strength as given at the location of the target cell in the table. All tables are symmetric with respect to the source cell column and only one side is plotted. Layers (rows) that d o not contain connections are not plotted. The actual direction (lateral, feedback, feedforward) of a connection type is described in the header of each table and the reader is rcfered to Figure 1. Inhibitory conncctions take negative values.
Visual Latencies to Improve Image Segmentation
h
Y c
3 -
t
1517
neuron A threshold
(d .c
S
a, 0
c
Q
a, C
s
I)
E
E
b
1; 2,2;,
phase neuron B
h v c
3 -
W B
(d .c
C
threshold
I I I
a, 0
c
Q
a, 5
s
I I
I)
E E
I I
2s
b
1B 2,
phase
Figure 8: (B) Membrane potential of neurons A and B demonstrating the process of synchronization of the two connected cells. Connection type 1: all basic excitatory feedforward connections from layer 1 to layer 1+1 have a weight of 0.55X, with X = (200 - latency parameter)/200, X = 1,
in layers 1-3 in layers 4-9
In layers 1-3 X is varied between 0.25 (latency parameter=150) and 0.9 (latency parameter=15) (Figs. 4-7). Connection type 2: excitatory and inhibitory connections within a layer: common multiplication factor: 0.04. Example: Source cell S g connects to B9 with weight 1.0 times 0.04.
Ralf Opara and Florentin Worgiitter
1518
Source cell
D
1.00 0.80 0.83 0.60 0.53 0 0.66 0 0.50 0.40 0 0.30 --0.02 0 0.20 0
0.50 0.11 0.33 -0.12 0 0
',
0
Sh
0
S
S; Sh 55
SJ
B
C
A
I-
E
G
-0.25 -4.20 -0.20 --0.21 0.17 0 -0.17 0 0 0 0 0 0 0 0 0 0 0
Layers 9 8 7 6
5 1
Connection type 3: excitatory connections from layer 1 to layer 1-1: common multiplication factor: 0.10. Example: Source cell S, connects to B3 with weight 0.83 times 0.10. Sou rcc ccll
Acknowledgments
A
B
~.
C
-~
D
E
F
C;
~
Layers
~
The author5 are grateful to E. Nelle for critical comments on the. manuscript F W. acknowledges the support of the Ueutsche Forschungsgemeinschaft WO-388/4-2.
Xhcles, M., Prut, Y., Bergman, H., Vaadia, E:., and Aertsen, A. 1993. Integration, ~ i o r . i i / (f synchronizity and priodicity. I n Brniri Plrc~or,~/.S ~ ~ r i f i c ~ - ~ ~ r i Aspcfs Rrniii Ftrrrctiori, A . Xertsen, ed., pp. 149-181. Elsevicr Science Pitbl. B.V., Amsterdam. 13olz, J., liosner, G., and le, H. 1982. Response> latency of brisk-sustainccl ( X ) and brisk-transient (Y) cells in the cat retina. 1. P l y s ~ o /328, , 171-190. Burgi, P. Y.,and Pun, T. 1991. Figure-ground separation: Evidenc-e for asynchronous proccissing in \.isual perception? Pc7rcqifiori 20, h9. Burgi, P. Y.,and Pun, T. 1994. Asynchrony in image analysis: Using the luminance-to-response-latrnc\. relationship t o impro\.e segmentation. /. Opt. SOL-.A J ? A~ . l l ( h ) , 17211-1726. Burr, D. C., Morrone, M. C., and Ross, J. 1944. Selecti\,tl suppression of the magnr)cellular \.isual path\.z,ay during saccadic eye mo\,ements. Nntirrc, (Loridori) 371, 511-513.
Visual Latencies to Improve Image Segmentation
1519
Carney, T., Paradiso, M. A., and Freeman, R. D. 1989. A physiological correlate of the Pulfrich effect in cortical neurons of the cat. Vis. Res. 29, 155-165. Desimone, R., and Duncan, J. 1995. Neural mechanisms of selective visual attention. Annu. Rev. Neurosci. 18, 193-222. Dinse, H. R., and Kriiger, K. 1994. The timing of processing along the visual pathway in the cat. NeiiroReport 5, 893-897. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitbock, H. J. 1988. Coherent oscillations: a mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121-130. Eckhorn, R., Frien, A., Bauer, R., Woelbern, T., and Kehr, H. 1993. High frequency (60-90Hz) oscillations in primary visual cortex of the awake monkey. NeiiroReport 4, 243-246. Enright, J. T. 1985. On Pulfrich-illusion eye movements and accommodation vergence during visual pursuit. Vzs. Res. 25, 1613-1622. Ernst, U., Pawelzik, K., and Geisel, T. 1994. Multiple phase clustering of globally pulse coupled neurons with delay. In ICANN ‘94: Proceedings of the International Conference on Artificial Neural Networks, Voliinie I , M. Marinaro and P. G. Morasso, eds., pp. 1063-1066. Springer, London. Fellenz, A. W. 1994. A sequential model for attentive object selection. Proc. IWK 94, Illmenau, FRG, 109-116. Gray, C. M., Konig, I?, Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Natiire (London) 338, 334-337. Hopfield, J. J. 1995. Pattern recognition computation using action potential timing for stimulus representation. Natirre (London) 376, 33-36. Ikeda, H., and Wright, M. J. 1975. Retinotopic distribution, visual latency and orientation tuning of sustained and transient cortical neurons in area 17 of the cat. Exp. Brain Res. 22, 385-398. Levick, W. R. 1973. Variation in the response latency of cat retinal ganglion cells. Vis. Res. 13, 837-853. v.d. Malsburg, C. 1981. The correlation theory of brain function. Int. report 812, Dept. of Neurobiology, Max-Planck-Institute for Biophysical Chemistry, Gottingen. McClurkin, J. W., Optican, L. M., Richmond, B. J., and Gawne, T. J. 1991. Concurrent processing and complexity of temporally encoded neural messages in visual perception. Science 253, 675-677. Milanese, R. 1993. Detecting salient regions in an image: From biological evidence to computer implementation. Ph.D. thesis, Univ. of Genova. Mirollo, R. E., and Strogatz, S. H. 1990. Synchronization of pulse-coupled biological oscillators. Siam J. Appl. Math. 6, 1645-1662. Niebur, E., and Koch, C. 1996. Control of selective visual attention: Modeling the “where” pathway. In Proceedings of NIPS, M. Mozer and M. Hasselmo, eds. Morgan Kaufmann, San Mateo, CA. Niebur, E., Koch, C., and Rosin, C. 1993. An oscillation-based model for the neuronal basis of attention. V7s. Res. 33, 2789-2802. Olshausen, B. A,, Anderson, C. H., and Van Essen, D. C. 1993. A neurobio-
1520
Ralf @para and Florentin Worgotter
logical model of visual attention and invariant pattern recognition based on dynamic routing of information. 1. Neirrosci. 13(l l ) , 47004719. Posner, M. I., and Dehaene, S. 1991. Attentional networks. T I N S 17/2. Raiguel, S. E., Lagae, L., Gulyas, B., and Orban, G. 1989. Response latencies of visual cells in macaque areas V1, V2 and V5. Briiiri Rcs. 493, 155-159. Sestokas, A. K., Lehmkuhle, S., and Kratz, K. E. 1987. Visual latency of ganglion X- and Y-cells: A comparison with geniculate X- and Y-cells. Vis. Rrs. 27, 1399-1 408. Singer, W., and Gray, C. M. 1995. Visual feature integration and the temporal correlation hypothesis. A I I I I R I Iw. . Nrrrrosc-i. 18, 555-586. Somers, D. C., Nelson, S. B., and Sur, M. 1995 An emergent model of orientation selectivity in cat visual cortex. I. Nrirrosc-i. 15, 5448-5165. Sporns, O., Gally, J. A,, Reeke, Jr., G. N., and Edelman, G. M. 1989. Reentrant signaling among simulated neuronal groups leads to coherency in their oscillatory activity. I ’ m . N o f l . A d . Sci, L/.S.A. 86, 7265-7269. Tononi, G., Sporns, O., and Edelman, G. 1992. Reentry and the problem of integrating multiple cortical areas: Simulation of dynamic integration in the 1,isual system. C ~ r r h r dC o r t t ~2, 310-305. Whittaker, S. G., and Siegfried, J. 8. 1983. Origin o f wavelets in the visual evoked potential. E / c c t , a c ~ r i ~ ~ Clrrr. - t ~ ~ ~NPirrc)p/y5io/. /f. 55, 91-101. Wdrgottrr, F., and Koch, C. J. 1991. A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connection schemes for orientation selectivity. ~ ~ , [ r ~ ( ~ ~ ~ / i !11, /si[~/o~?/ 1959-1 979. Wiirgiitter, F., @para, R., Funke, F., and Eysel, U. 1996. Utilizing latency for object recognition in real and artificial neural networks. Ncirra~
I
April 1, I996
This article has been cited by: 2. Axel Frien, Reinhard Eckhorn. 2000. Functional coupling shows stronger stimulus dependency for fast oscillations than for low-frequency components in striate cortex of awake monkey. European Journal of Neuroscience 12:4, 1466-1478. [CrossRef] 3. Ralf Opara , Florentin Wörgötter . 1998. A Fast and Robust Cluster Update Algorithm for Image Segmentation in Spin-Lattice Models Without Annealing—Visual Latencies RevisitedA Fast and Robust Cluster Update Algorithm for Image Segmentation in Spin-Lattice Models Without Annealing—Visual Latencies Revisited. Neural Computation 10:6, 1547-1566. [Abstract] [PDF] [PDF Plus] 4. Kiyohiko Nakamura. 1998. Neural Processing in the Subsecond Time Range in the Temporal CortexNeural Processing in the Subsecond Time Range in the Temporal Cortex. Neural Computation 10:3, 567-595. [Abstract] [PDF] [PDF Plus]
Communicated by Scott Fahlman
Learning and Generalization in Cascade Network Architectures Enno Littmann Abt. Newoinfoumntik, Fakiiltuf fiir Inforrnatik, Uiiiversifft Ulrn, 0-89069 Ulrn, Gerrnnny
Helge Ritter AG Neiiroinforrnatik, Technische Fakultdt, Utziuersituf Bielefeld, D-33615 Bielefeld, Germany Incrementally constructed cascade architectures are a promising alternative to networks of predefined size. This paper compares the direct cascade architecture (DCA) proposed in Littmann and Ritter (1992) to the cascade-correlation approach of Fahlman and Lebiere (1990) and to related approaches and discusses the properties on the basis of various benchmark results. One important virtue of DCA is that it allows the cascading of entire subnetworks, even if these admit no errorbackpropagation. Exploiting this flexibility and using LLM networks as cascaded elements, we show that the performance of the resulting network cascades can be greatly enhanced compared to the performance of a single network. Our results for the Mackey-Glass time series prediction task indicate that such deeply cascaded network architectures achieve good generalization even on small data sets, when shallow, broad architectures of comparable size suffer from overfitting. We conclude that the DCA approach offers a powerful and flexible alternative to existing schemes such as, e.g., the mixtures of experts approach, for the construction of modular systems from a wide range of subnetwork types. 1 Introduction
There are several good reasons to develop alternatives to the popular multilayer perceptron trained by the error backpropagation learning rule (Werbos 1974; Rumelhart el al. 1986). The major drawbacks are that you have to predefine your network architecture, that backpropagation is very slow, especially if the structure comprises more than one or two layers, and that there are network architectures that cannot be cascaded because there is no suitable learning algorithm available (e.g., vector quantization networks). Neural Computation 8, 1521-1539 (1996) @ 1996 Massachusetts Institute of Technology
E n n o Littrnann and Helge Ritter
1572
In the last years, many approaches have been developed that modify the i n f m d network structure without having direct evaluation information for the quality of the internal units. These are the pruning methods (e.g., LeCun rt 01. 1990), as well as incremental approaches that build internal structures (Mezard and Nadal 1989; Frean 1990). The work of Nabhan and Zomaya (1994) is a recent approach combining growing and pruning of a MLP for function approximation, Fahlman and Lebiere proposed networks with a narrow, deeply cascaded structure trained by the cascade-correlation algorithm (Fahlman and Lebiere 1990) . Internally nonlinear units are added incrementally a n d trained to maximize the colvariance with the residual error. Thus, the success of the cascade units can be controlled directly without the need of the backpropagation rule. In the next section we present a new direct cascade architecture deri\wi from the cascade-correlation apprciach by inversion of the construction process and change of the learning rule. We discuss similarities, differences and adL7antages of the approaches. For many real-world applications, a major constraint for the successful learning from examples is the limited number of examples availablcl. Thus, methods arc required that can learn from small data sets. Howe\’er, i t the number of adjustable parameters in a network approaches occurs and the number of training examples, the problem of c~i~cvfiittiri~y generalization can become \ w y poor (Baum and Haussler 1989). This severely limits the size of networks applicable to a learning task with a small data set. To achie\re good generalization even in these cases, the architecture must he matched carefully to the structure of the problem at hand. This is a particular domain of incrementally built networks. In Section 3.3 we show for the Mackey-Glass time series prediction that the direct cascade architecture can exploit the potential of large networks to bear on the problem of extracting information from small data sets without running the risk of overfitting. 2 Cascade Architectures
-~
~
.~
-
~
2.1 Cascade Correlation. In the original cascade-correlation architecturc’ CASCOR (Fahlman and Lebiere 1990) the nonlinearity is achieved by incrementally adding sigmoid units trained to maximize the covariance with the residual error. The algorithm starts with the training of the output unit !/ to approximate a target function t i : ), using gradient descent. After stagnation of the coniwgence process the weights become ”frozen” temporarily. The current output of h r e l 0, yo, produces a residual error Eoi ) t ( < ) - y,1( ). An independent unit ci1 is now trained to maximize normalized with its the covariance S between the residual error € , I ( [ ) average over all patterns and its own normalized activity co(( ) q:
<
=I
<
~
Cascade Network Architectures
1523
For scalar output and using the tanh activation function, this yields the adaptation rule
where 1 - ci(() is the partial derivative of Co([) with the hyperbolic tangent. After stagnation, the weights w:")become frozen and the output co(() of the new cascade unit is added to the set of input values of the original network. The output unit is then retrained yielding a new output yl([) based on the extended input. This two-step method can be iterated arbitrarily, e.g., until an output y,, after n cascading steps yields a sufficiently accurate approximation of the target function t ( ( ) . The process leads to a network architecture as shown in Figure 1. 2.2 Influence of the Optimization Scheme. To apply error minimization as optimization scheme for the cascade units two major changes to the original CASCOR algorithm are required. We discard the cascade output normalization C, in order to apply incremental learning.' For compatibility reasons, the residual error must not be normalized because the cascade output should approximate the error exactly if we apply error minimization. The adaptation rule (2.3) then reads awjco'(~)co= r vEo(<)
[I-
C;(E)]
(2.4)
~ Z ( E )
This equation implements a cascade-correlationvariant with incremental learning scheme referred to as CC-INC. If we replace the correlation maximization for the training of the cascade units by error minimization, the adaptation rule for the first cascade neuron C o is given by
=
v Fo")
- co(O1 [I - c m ]
xdt)
(2.5)
Networks using this update rule will be referred to as CAS-EM (cascades with error minimization). The main difference between the resulting adaptation rules 2.4 and 2.5 is the additional term -co(<) in the error signal of the error minimization rule. In both cases the term [l - ci(()] controls the adaptation gain. If 'In batch learning, the derivative of the normalization term T;, cancels out if summed over a full epoch. Learning incrementally, the derivative can be approximated only by a moving average.
Enno Littmann and Helge Ritter
1524
Single output
Output neuron trained to approximate entire target function
Cascade neurons trained to maximize covariance with error &
f
Figure 1: Basic architecture constructed by the CASCOR algorithm. At the beginning and after insertion o f each cascade neuron C,, the output neuron Y is trained to approximate the target function. The cascade neurons C, are once trained to maximize the covariance of their output with the current error of the output neuron.
Cascade Network Architectures
1525
the absolute value of the cascade output approaches 1 the adaptation is negligible. This may happen quite often because the residual error t ( ( ) - yo(<) can reach absolute values larger than 1, so that the exact value cannot be reproduced.* Due to the nature of covariance, equation 2.3 leads to an increase of the weights until the output co(() runs into saturation unless the training is stopped earlier, eg., because the performance improvement is controlled. The residual error term guarantees that examples with small residual errors have less impact on the weight adaptation than those with large error. Adaptation rule 2.5 stops adapting the weights when the output either equals the residual error or approaches an absolute value of 1. Thus, the adaptation speed is additionally controlled by the difference between the output and the residual error term. Cascade correlation units rather care about the rnagnifude of their correlation with the residual error and tend to yield binary output. In contrast, units trained to minimize the residual error are suited to approximate the continous-valued residual error function. This implies that the correlation might be the better scheme for binary-valued problems, whereas error minimization should be favored in the continuous case. 2.3 Direct Cascading. Like CASCOR, the CAS-EM approach has only one output layer while incrementally building a cascaded preprocessing structure inside. These internal units are trained to compensate the residual error of the current output. The approximation has to be learnt on the basis of the original input and the activity of previous cascade units. In the case of linear networks there is no difference between subtracting the current output from the target value or adding it to the current input.3 The latter method has the advantage that the construction scheme can be inverted. Instead of all cascade units learning different target functions (i.e., different residual errors), all units learn the same target function on the basis of different input vectors of growing dimension and complexity. Thus we arrive at an architecture that differs significantly from the cascade-correlation approach. The algorithm starts with the training of a sigmoid unit with output yo to approximate a target function t ( ( ) according to
After an arbitrary number of training epochs, the weight vector w(’’I)) becomes ”frozen.” Now we add the output yo of this unit as a virtual 2Usually, the target values for units with the tanh activation function range from [-1.11. ’Note, however, that the equivalence of the linear cases in strength does not hold for nonlinear activation functions. These are necessary, because linear cascades can be reduced to one single layer. This issue is discussed in Littman and Ritter (1994) in more detail.
1526
Enno Littmann and Helge Ritter
input unit and train another perceptron ns
tZc’it’oiitytrt trriit
V+l
=
tanh
[
’
<
iO:’’’ s,( ) i=P
1
.
yI with
(2.8)
the input vector extended by the element s , y + l ( < ) = !/,I([). This procedure can be iterated arbitrarily and generates a network structure as depicted in Figure 2. At each stage of the construction process all units provide an approximation of the target function t ([ ), albeit corresponding t o different states of convergence (Fig. 2). We call this network type dirc.ct cascade (1 ~ c lifi rct I I rc ( “DCA’). 2.4 Cascade Modules. Besides the nonlinear activation function (Fig. 2a) there is another way to achieve nonlinearity. We can add nonlinear functionsf(!/,) together with yl in the ith cascade step to the input 1,ector as indicated in Figure 2b. By this means we provide units yielding more complex decision surfaces than hyperplanes. For example, if we provide the squared output of some sigmoid unit as additional input the network can form “hyperstripes.” Thus, at each cascade step the network is offered a whole set of new different nonlinearities from which the adaptation process can ”choose,” albeit at the expense of some increase in training time per layer. In Littman and Ritter (1994) we decribe experiments where we use powers of the output y l . The direct cascade algorithm is not restricted to perceptron-like units. I t can be applied to m y nrl7ifrnry 11011liiimr module and does not rely on the availability of a procedure for error backpropagation. Therefore, in addition to pure feedforward approaches like simple perceptrons, it is also applicable to units performing vector quantization or ”local linear maps” (”LLM,” Fig. 2c). 2.5 Local Linear Maps. LLM networks have been introduced earlier (Fig. 3; for details, cf. Ritter 1991; Ritter et ill. 1992). They are related t o the self-organizing maps (Kohonen 1982; Kohonen 1990; Ritter 1991). The basic idea is to perform a vector quantization of the input space combined with an adaptive, locally \Tatid, piecewise linear approximation of the output values. Without this linear interpolation term, using a weighted superposition of the output of all nodes, the network structure is equivalent to GRBF networks (Poggio and Girosi 1990). There ha\ze been a number of other approaches building on similar ideas of learning a piecewise linear approximation. Some base the approximation on nll previously seen data (Shepard 1968, Farmer and Sidorowich 1987); others use the k-d tree method (Friedman ct nl. 1977) to determine the cluster centers (Cleveland rt a / . 1988; Stokbro 01. 1990; Omohundro 1991; Atkeson 1992). c T t
Cascade Network Architectures
1527
Current output
-
Output / cascade neurons All trained to approximate entire target function
(
Figure 2: Basic network structure constructed by the direct cascade architecture DCA. Starting with neuron Y1 it is first used as output neuron and thus trained to approximate the target function. After stagnation of the learning process, it is used as cascade neuron and a new neuron Y2 is inserted as output neuron receiving the output approximation of neuron Y1 as additional input. The sigmoid neurons (a) can be enhanced by adding nonlinear functions of their output (b) or even be replaced by more powerful units like LLM networks (c). The influence of the linear approximation is depicted in Figure 3. LLM networks consist of N units r = 1... . .N, with an input weight E R ~ a,n d an M x Lvector w:ln) E RL, a n output weight vector w$""~) matrix A, for each unit r. The matrix A, performs the linear interpolation within the tessellation cell of node r. of a single LLM network for a n input feature vector The output x E I I B ~is net)
y'
(x) = Y,(x) = w!'"')
+ Aq(x- w:'"))
(2.9)
Enno Littmann and Helge Ritter
1528
Y
1
Figure 3: LLM network in one dimenslon The desired output value y("' is approximated by the output y. of the nearest reference node 5 . The reference nodes I - 1 and I + 1 are more distant in the input space. the "winner" node s determined by the minimaiity condition d, = / / x- w!'"'//= min, / / x- w / ~ ~ " I /
(2.10)
This leads to the learning steps for a training sample ( x ( " ' .y'"'): (2.11) (2.12) and
LAs
=
~ 3 ( ~ ~ ) " ( y-'y"( l" e t l ) ( ~-[ "~i 1
~"))~
(2.13)
applied for T samples ( X I ' ' ' . y ( " ) )( .t == 1 . 2 . .. . T , and 6 , . i = 1.2.3 denote learning step sizes. Their value usually decays exponentially from a large initial value (typically cfnitia'= 0.9) to a small final value (typically ,final , -- 0.01). Learning rule 2.11 is very similar to the on-line k-means update rule. The update of w.!"" according to 2.11 leads to a change of the network output y. via the linear interpolation term. This must be compensated in the update of w.!~"~' 2.12 by the additional term +A,Aw!'"'. The performance of this algorithm can be further enhanced by implementing a softmax method to calculate the network output as a weighted superposition of all node outputs, thus yielding continuous output values. The acfizTity of the single node 11, decreases exponentially with the
Cascade Network Architectures
1529
Euclidean distance to the input vector. It is defined as (2.14) where the degree of overlap is controlled by an additional parameter ,3 and a node-specific measure py that depends on the average distance of the node r to its neighbors. The softmax method is usually applied during the final training phase where the update steps 2.11-13 for node r are weighted with its activity a,. The softmax output is obtained by N
(2.15) r=l
3 Experiments
For the experimental evaluation of the direct cascade approach, a variety of well-known benchmark problems have been considered. A more thorough investigation of the performance of DCA networks built of perceptron units on the XOR problem and other problems is given in Littman and Ritter (1994). 3.1 Classification Tasks. We use the two-class benchmark problems known as Easy Task and Hard Task (Kohonen 1988) with 2 and 6 dimensions (cf. Appendix). The classes consist of multivariate normal distributions with variable dimension. For both tasks, a lower and an upper bound can be computed. Figure 4a-d compares the performance of DCA networks with different cascade modules to results achieved with the original CASCOR net: with our cascade-correlation implementation CC-INC described in Section 2.2, and the theoretical bounds. For both 2D tasks, using sigmoid neurons, the network types with incremental learning scheme, DCA and CC-INC, begin with the performance of a linear classifier. All cascade methods lead to a significant performance improvement. The DCA performance increases faster and converges against a larger classification rate. Different results are achieved for the 6D tasks. The performance of the original CASCOR algorithm is equivalent or even slightly better (hard task) compared to the DCA. The CC-INC implementation performs clearly worse. For both tasks, the addition of nonlinear functions of the output (cf. Section 2.4) yields a further performance improvement described in Littman and Ritter (1994). This improvement is big for the 2D cases but negligible in the more complex 6D cases. For both 6D tasks, the classification rates of the cascade algorithms based on sigmoid neurons are not satisfying compared to the rate of the 'The CASCOR implementation was obtained from ftp.cs.cmu.edu:/afs/cs.crnu.edu/proj ect/connect/code/supported/cascor-c
1530
Enno Littmann and Helge Ritter
---* xd
.......5 IONodes(DCA-LLM)
---.3
9
h e a r separation
B 1
3
5
7
9
Layers [#I
(a) Easy Task
(a')
(b) Easy Task
1
(c) Hard Task (W')
5
(ab)
10 15 Layers [#]
20
(d) Hard Task (R6)
Figure 4: Classification performance on Easy Task and Hard Task in R' and Rh. The performance of DCA-LLM networks with different LLM size is com-
pared to a sigmoid DCA, the original CASCOR implementation, and to our implementation of cascade-correlation with incremental learning (CC-INC).
Bayes classifier. After some 15 layers the classification performance of the cascade network stagnates. Adding more cascade layers yields no further improvement. If we want to achieve better results, we have to use more powerful components for the cascade. As the DCA has no limitations concerning the type of units, we use LLMs (Section 2.5) as network modules. The performance of networks using LLMs with 2, 3, 5, and 10 nodes per layer ("DCA-LLM") is also shown in Figure 4. Already single LLM nets achieve a considerable classification rate and the cascading improves
Cascade Network Architectures
1531
the performance to the vicinity of the optimal rate.5 Thus, the DCA-LLM approach clearly outperforms the other architectures, especially for the 6D easy task. The performance of the DCA-LLM networks is closely related to the number of nodes per LLM layer. This can be seen in Figure 4c for the 2D hard task. Applied to these relatively simple tasks, even cascades using very few nodes per LLM net converge against the optimum value, albeit employing a larger number of layers. 3.2 Time Series Prediction. Another well-known benchmark is the prediction of the Mackey-Glass time series (cf. Appendix). This task is very hard for both types of cascade networks, when perceptron units are used as cascade components. We were not able to achieve a reasonable approximation with a NRMSE below 0.65. Similar experiences have been reported by Fahlman (1992) and Crowder (1990). This is no general problem with prediction as experiments with simpler functions showed. The advantages of cascades can be exploited, however, if we employ more sophisticated units as components. Crowder investigated the performance of CASCOR using gaussian units, achieving a minimal NRMSE of 0.32 (Crowder 1990). In our studies, we use complete LLM subnetworks (cf. Section 2.5). Figure 5 demonstrates the influence of the size of the cascaded LLMs for lo5 training steps and data sets &o, consisting of 500, and Lsooo = 5000 samples. In both cases the performance can be improved significantly. On average, the remaining error of a single layer can be reduced to 50% by cascading. The building of cascade layers yields an improvement of ever decreasing gain and converges against a saturation level determined by the size of the LLM subnetworks and the size of the available training set. This confirms the hypothesis that the preprocessing power of the single cascade unit is an important determinant for the potential of the cascade approach. Note, however, that there is one caveat. The outlier in Figure 5a, a DCA-LLM with 160 nodes per layer applied to the small data set Csoo, shows worse performance than a cascade with LLMs consisting only of 20 nodes per layer. This effect is due to ovevfitting of the small data set. This phenomenon will be discussed in the next section. Hartmann and Keeler have compared the performances of a variety of different network types on the Mackey-Glass prediction benchmark using a training set consisting of 500 samples and a total of 300 training epochs (Hartmann and Keeler 1991). For the same constellation, our best DCA-LLM with 3 layers of 70 nodes each (2310 Parameter) achieves a minimal NRMSE of 0.033 (average 0.037) on the training set and 0.043 on the test set (average 0.050). The training consisted of 60 epochs (winner5The performance measured on the basis of a finite number of data points is an approximation fluctuating around the asymptotic performance. It therefore can reach and even exceed the theoretical optimum value.
Enno Littmann and Helge Ritter
1532
0.5
-10
- - m ..a
....SON&
0.4
0.3 0.2 0.1 -
0
1
3
5
7
0
9
W e n [#I
(b) 5000 training bamples
(a) 500 training samples
P 20
40 60 80 Nodes I LLM Layer
(c) Iso-layer-Dependence
100
(d) Iso-Nodes-Dependencr
Figure 5: Performance for the Mackey-Glass benchmark: DCA-LLM networks with different LLM sizes are trained with (a) 500 and (b) 5000 training samples. (c) Iso-la!/er-dqicriiferlsc: The error depends on the number of nodes/LLM layer. (d) Isci-iiofie.s-dr~ierzderrcr:If a fixed number of nodes is given, the performance depends on the number of layers among which the nodes are distributed.
take-all) and 20 epochs with softmax per layer, a total of 240 training epochs (Table 1). The training time T for a DCA-LLM network depends on the number of layers L, the number of nodes per layer N, and the number of adaptation steps S. As an example, on an IBM RS6k-350 computer the training of a DCA-LLM with LLM layers consisting of 50 nodes each, and each tayer trained with 50,000 patterns (100 epochs for the .Cm data set), takes 14 sec for the first layer, 80 sec for five layers, and 164 sec for 10 layers. Softmax training requires approximately three times as much computing time per learning step.
Cascade Network Architectures
1533
Table 1: Performance of DCA-LLM Cascades for the MackeyGlass Time Series on Training and Test Set Compared to Results from Hartmann and Keeler (1991) Network architecture
Total parameters
NRMSE Training Test
Training epochs ~
LMS Gaussian bars Sigmoids RBFs LLM cascade LLM cascade LLM cascade
0
5
0.54
0.59
400
300, lrate 10-10, lrate 300, wide 2 x 50 5 x 40 3 x 70
4500 171
0.06 0.06 0.06
0.08 0.08 0.08 0.062 0.062
400 200,000 100/300 2 x 150150 5 x 60/20 3 x 60120
1801 1000
2600 2310
0.054 0.052 0.037
0.050
3.3 Generalization Abilities. To evaluate the generalization performance of the direct cascade architecture, we consider the problem of time series prediction based on the Mackey-Glass differential equation, with a training set consisting of 500 samples. The first results were achieved with cascade networks consisting of LLM nets after a total of 300 training epochs on a learning set of 500 samples. The training epochs were distributed equally among the layers, i.e., for a cascade of 5 layers, each layer was trained 300/5 = 60 epochs. Figure 5c shows the performance of such LLM cascade networks on the independent test set for different numbers of cascaded layers as a function of the number of nodes per layer ("iso-layer-curves"). The graph indicates that for the given task there is an optimal number of = 40 nodes for which the performance of the single layer network has a best value P$ (NRMSE z 0.16). Networks with a single LLM layer consisting of more nodes perform increasingly worse due to overfitting. However, if we add more nodes in form of a second layer, Figure 5c shows that we can increase the network performance significantlybeyond P!$ Similarly, the performance of the resulting two-layer cascade network cannot be improved beyond an optimal value P$ by arbitrarily increasing the number of nodes in the two-layer system. Adding a third cascade layer again allows us to make use of more nodes to improve the performance further, although this time the relative gain is smaIIer than for the first cascade step. The same situation repeats for larger numbers of cascaded layers. This suggests that the cascade architecture is very suitable to exploit the computational capabilities of large numbers of nodes for the task of building networks that generalize wellfrom small data sets without running into the problem of overfitting when many nodes are used in a single layer. Now we can ask how a givenfixed number of nodes should be optimally arranged in a DCA-LLM network. Is it better to arrange N nodes in a
Enno Littmann and Helge Ritter
153-1
single layer network, in two layers with N , ' 2 nodes per layer, or even in four layers with N,'4 nodes? Figure 5d shows the NRMSE results if N = 30.60.120.240 nodes are distributed equally among L layers of N / L nodes each," L ranging from 1 to 10 layers ("is~~-nodes-curves"). The results show that (1) the optimal number of layers increases monotonically with-and is roughly proportional tc+the number of nodes to be used; (2) if for each number of nodes the optimal number of layers is used, performance increases monotonically with the number of available nodes, and thus, as a consequence of (l),with the number of cascaded layers.
4 Discussion
.....~.
.
4.1 Direct Cascading and Related Approaches. The application of both incremental cascade algorithms to the benchmark problems discussed above yielded significant performance improvements. For the range of tasks studied in our paper, the DCA results were consistently and significantly better than those using our cascade-correlation implementation CC-INC that uses neither quick-prop nor candidate pools nor batch training. This modification facilitated the comparison of the different optimization schemes. The comparison with the original CASCOR scheme of Fahlman and Lebiere (1990) yields two different results. For the 6D tasks the CASCOR scheme performs better than o u r cascade-correlation implementation, and at least as good as the DCA, for the hard task even better. For the 2D classification tasks, however, it achieves results equivalent to our implementation of the cascade-correlation scheme. These results show that the algorithm can be implemented using incremental learning but with the risk of possible performance losses. It is remarkable that CASCOR always performs clearly better on the training data than on the test set, while this difference is less distinct for the algorithms with incremental learning. The DCA proves to be a good alternative to CASCOR, with similar efficiency and advantages due to the inversion of the cascade procedure that results in the proposed one-step algorithm. Another related algorithm is the E s t c i i t t - n i l 1 7 l g o ~ i t h i i isuggested by Baffes and Zelle (1992) to solve classitication problems with binary and with two-bit output. Their network is incrementally constructed by adding perceptrons that are trained to split at least one of the training examples from the data set. Thus, problems that are not linearly separable can be solved in a l'ery straightforward manner. The resulting network consists of a cascade of perceptrons identical to our perceptronbased DCA networks. They investigate various classification problems,
Cascade Network Architectures
1535
confirming our results for the XOR problem and yielding results for other benchmarks like the two-spiral problem or the soybean data set. Their restriction to binary and two-bit output problems and perceptron units as cascade components results in a limited applicability of their algorithm. In CASCOR and DCA, specific training and evaluation information is available during the training of each unit or subnetwork. Thus no method to backpropagate the error is necessary and the success can be measured immediately, not implicitly by the success of the complete net. Training all cascade units simultaneously in a cascade of predefined size yields less favorable results. This might be due to the large fluctuations that occur in this case and has not been pursued any further. A unique feature of DCA is that the activity of any single unit represents a full mapping of the input function. Thus, we can regard a DCA net as a single net with a deeply layered structure, if we count all the intermediate units as hidden or preprocessing units. On the other hand, we have a cascade ufinulfiple nets all trying to map the same input, but on the basis of differently refined preprocessing provided by the predecessors in the cascade. From this point of view, the DCA is a method to construct a modiilar hierarchy of complete networks. This view is especially appropriate if we use complex networks like LLM nets as cascade units.
4.2 Preprocessing. The cascade architectures can be regarded as methods to incrementally build a powerful preprocessing device. The preprocessing capability of the single component seems to delimit the utmost performance improvement possible by means of the cascade algorithm (Section 3.1). Our results indicate that sigmoid units are well suited to cape with certain classification problems. For approximation tasks, much more efficient use of the potential of the method is made by cascading more powerful mapping units like LLM-networks. In this view, the cascading approach becomes a general tool for building rnodular systems. Our results for the case of LLM networks as cascade components illustrate two important benefits of such an approach: (1) Several small subnetworks arranged in a direct cascade obtain better generalization abilities than a single, large network of comparable size (Section 3.3). (2) The method can be applied even for types of modules for which no error-backpropagation is possible. This greatly enhances the scope of the approach and offers also extremely interesting perspectives for the integration of neural networks with, e.g., standard approximation schemes, into hybrid systems. Future work should aim at comparing this strategy for building modular systems with existing alternative methods such as the mixture of experts algorithm (Jacobs et al. 1991; Nowlan and Hinton 1991). The use of LLM networks as cascaded components can actually be viewed as an initial step toward the integration of mixturesof-experts with deeply cascaded architectures.
Eiino Littmaiiin a n d IIelge Ritter
15%
5 Conclusions
.-
~
~-
Neural networks offcr a wide range of architectural possibilities. The present paper builds on the cascadeecorrelation approach by Fahlman and Lebiere (1990) and presents a related, direct cascade architecture that differs in several important aspects. The error-minimization scheme facilitates the approximation of cvntinuous-valutci problems. The strategy for the nettvork construction is straightforward as each unit (cascade layer) learns the same target function albeit on the bassis of an input \rector of increasing dimension. We calculated nonlinear functions of the output activity of the cascaded units to generate a set o f additional nonlintw inputs. Furthermore, complete LLM networks were employed as cascade components to achie\,e a more powerful preprocessing. Our results show that DCA networks are very powerful and flexible to be adapted to the problem at hand. The cascade components must be matched to the complexity of the task and to the amount of data a\.ailable. In the case of classification tasks the DCA can exploit the incremental splitting of the data set by the cascade units. For problems with continuous-valued output we can usually expect at least some performance improvement, e1.m if the single cascade components are alrtmi), properl!, adapted to the specific task. Another important feature of the cascade architecture is its ability to make extremely efficient L I S ~of the a\railable data. For a limited number o f training samples, our results ivith DCA-LLM networks indicate that i t is more favorable to use cascaded nettvorks than shallow broad architectures, provided the same number of nodes is used in both cases. I h e DCA-LLM allows LIS to use the benefits of large numbers of nodes even for small training sets and still bypasses the problem of overfitting. The "width" of each individual layer must be matched to the size of the training data set. The cascade "depth" is then determined by the total number of nodes available.
6.1 Kohonen Problems. The classes (Kohonen 1988) consist of multi\Tariate normal distributions: c', ( / / = 0.yr = l i ; C2 [ (2.32.0. . . . . 0 i. 21; c ' ; ( O . 2 ) . The data sets are generated with 2 and 6 dimensions. Learning and test sets are independent, consisting of 1000 samples each. The E175y 73sk is to distinguish between members of Ci and C 2 . To classify samples of CI and C2 is called the H171.11 Tisk. For these classification tasks, the optimal Bayes classifier can be computed analytically. We find Bayes classification rates for the Easy Eisk of 83.3"cp(??) and 9 2 . 0 " ~(P), ~ and for the Hard 7ii.ik of 72.8% (P') and 86.3";) ( z " ) . A lower limit is given by the application of a linear classifier in the
Cascade Network Architectures
1537
first dimension. This separation yields a classification rate of 79.2% (easy task) and = 57% (hard task).7 The results of the algorithms DCA and CC-INC were measured after 10 training epochs (10,000 adaptation steps), samples randomly drawn from the training sets. For the CASCOR training, the patience parameter was set to 20 epochs, yielding an average training of the output neuron of about 40 epochs and 80-150 epochs for the cascade neuron. Only one candidate unit was used. All results are averages over 100 runs. Network answers with ynet(()t ( ( ) > 0.1 were regarded to be correct. For the CASCOR net, this is equivalent to a score threshold of 0.45. 6.2 Time Series Prediction. The benchmark problem of predicting the Mackey-Glass time series (Lapedes and Farber 1987) is based on the differential equation (Mackey and Glass 1977) X(f) = -bx(t)
+ 1 +ax(t x'O(t
-
r) -7)
With the parameters a = 0.2, b = 0.1, and r = 17, this equation produces a chaotic time series with a strange attractor of fractal dimension d = 2.1. The input data are vectors x ( t ) = ( x ( t ) . x ( t - A).x(t - 2 A ) . x ( t 3 ~ l ) )The ~ . learning task is to predict the value x ( t + P). To facilitate comparison, we adopt the standard choice A = 6 and P = 85. Results with these parameters have been reported, e.g., in Hartmann and Keller (1991), Lapedes and Farber (1987), and Moody and Darken (1988). A dataset of 11,000 points was generated by Runge-Kutta integration with 30 steps per time unit. The training set CsoO consists of data points T = 1000.. .1500, iC5o0" from T = 1000.. 6000. We used different numbers of training epochs with samples randomly chosen from the training set. The performance was measured on an independent test set of 5000 samples reaching from T = 6000.. . 11,000. All results are averages over ten runs. The error measure is the normalized root w i e m sqirarr~ error (NRMSE), i.e., predicting the average target value of the dataset yields an error value of 1.
Acknowledgments We gratefully acknowledge the insightful comments of the reviewers. This work was supported by the German Ministry of Research and Technology (BMFT), Grant ITN9104AO. Any responsibility for the contents of this publication is with the authors. 'For both tasks, this limit is independent of the dimensionality of the data sets, due to the mutual independence of the superposed distributions.
1538
Enno Littmann and Helge Ritter
References
Atkeson, C. 1992. Memory-based approaches to approximating continuous function. In Norr/irltwr Morfdirlg orid Forccnsfitig, M. Casdagli, and S. Eubank, eds., Vol. XI1 of S F I Sh1Nt.s iri tlie Scicnces of Corrrplt~xity,pp. 503-521. Addison-Wesley, New York. Baffes, P., and Zelle, J. 1992. Growing layers of perceptrons: Introducing the extentron algorithm. Proc. Int. JoirifCar$ Neiirnl Nrtiuorks, Vol. 11, pp. 392-397. Baum, E., and Haussler, D. 1989. What size net gives valid generalization? Neirrnl Corrrp. 1, 151-160. Cleveland, W., Devliii, S., and Grosse, E. 1988. Regression by local fitting: Mcthods, properties, and computational algorithms. N o i r l i r i ~ rModd. Fortwst. 1. ECOrlOJfIf’f. 37, 87-1 14. Crowder, R. S. 1990. Predicting the Mackey-Glass timeseries with cascadecorrelation learning. In Corzriccficlriist Mudrls: Procedirigs qf t/zc 1990 Sfrrtirrrer Scliool, D. S. Touretzky, J. L. Elman, T.J. Sejnowski, and G. E. Hinton, eds., pp. 524-532. Morgan Kaufmann, San Mateo, CA. Fahlman, S. E. 1992. Personal communication. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Ad.ivrrci>siri Ncirrnl liforiizntioti P romsirr,y S!ysterus 2, D. S. Touretzky, ed., pp. 524-532. Morgan Kaufmann, San Mateo, CA. Farmer, J. and Sidorowich, J. 1987. Predicting chaotic time series. P l y s . R r v Lctt. 59, 845-848. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. N w r n l Corrip. 2, 198-209. Friedman, J. H., Bentley, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. A C M Trmsnct. Mntli. Sqftirnrt 3, 209-226. Hartmann, E., and Keeler, J. D. 1991. I’redicting the future: Advantages of semilocal units. N w r n l Corup. 3, 566-578. Jacobs, R., Jordan, M., Nowlan, S., and Hinton, G. 1991. Adaptive mixtures of local experts. Nriirnl C o r y . 3, 79-87. Kohonen, T. 1982. Self Orgnrlizntiori rirrif Associnfiirtz Mtwiory. Springer Series in Information Sciences 8. Springer-Verlag, Berlin. Kohonen, T. 1988. Statistical pattern recognition with neural networks: Benchmarking studies. Proc. ICNN. IEEE Neural Network Council. Kohonen, T. 1990. The self-organizing map. Procecditzgs I E E E , 78, 1464-1480. Usitig Neiirnl Nc+ Lapedes, A,, and Farber, R. 1987. Noriliricvr S i p n l Proc~~ssiiig 7oorks: Prcdic-tiori mid S y s f m i Moddirig. Tech. Rep. TR LA-UP-87-2662, Los Alamos National Laboratory, Los Alainos, NM. LeCun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In All~ J O I I C C Y1 1 1 Ncurnl Irfurrtrotior~Proct,ssirig S!/sferiis 2, D. S. Tourctzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. Littmann, E., and Ritter, H. 1992. Cascade network architectures. Proc. Irit. loirrt C o ~ l fNcirrnl . N Z ~ W O YVol. ~ S ,11, pp. 398404. Littmann, E., and Ritter, H. 1994. Atrn/!/sis orid Applicntioris of thiJ Diwcf Cnscnde
Cascade Network Architectures
1539
Arckitectirre. Tech. Rep. TR 94-2, Department of Computer Science, Bielefeld University, Bielefeld, Germany. Mackey, M., and Glass, L. 1977. Oscillations and chaos in physiological control systems. Science 197, 287-289. Mezard, M., and Nadal, J.-P. 1989. Learning in feedforward layered networks: The tiling algorithm. I. Phys. A 22, 2191-2204. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Coilizectionist Models: Proceedings of f l i p 1988 Siimnier School, pp. 133-143. Morgan Kaufmann, San Mateo, CA. Nabhan, T., and Zomaya, A. 1994. Toward generating neural network structures for function approximation. Nrirrnl Networks, 7, 89-99. Nowlan, S. J., and Hinton, G. E. 1991. Evaluation of adaptive mixtures of competing experts. In Advanccs in Neiiml I~forimticiiiProcessirig Skystenis 5, D. S.Touretzky, ed., Morgan Kaufmann, San Mateo, CA. Omohundro, P. 1991. Bumptrees for efficient function, constraint, and classification learning. In Ad71arices in Neirunl Iizfornintior? Processirig S!ysterris 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 693-699. Morgan Kaufmann, San Mateo, CA. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247. Ritter, H. 1991. Learning with the self-organizing map. In Artficinl Neiirnl N e h ~ r k 1, s T. Kohonen, K. Makisara, 0. Simula, and J. Kangas, eds., pp. 357364. Elsevier Science Publishers B.V., North Holland. Ritter, H., Martinetz, T., and Schulten, K. 1992. Neiiml Coniprtntiori mid St.!fOrgnuizirrg Mnps: An Zritrodiiction. Addison-Wesley, New York. (English and German). Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nntirre (Londoii) 323, 533-536. Shepard, D. 1968. A two-dimensional interpolation function for irregularly spaced data. Proc. 23rd Natl. Conf. ACM 517-523. Stokbro, K., Umberger, D., and Hertz, J. 1990. Exploiting neurons with localized receptive fields to learn chaos. Complex Syst. 4, 603-622. Werbos, I? 1974. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard Univ. Committee of Applied Mathematics, Cambridge, MA. -~~
~
Received April 5, 1994, accepted February 13, 1996
This article has been cited by: 2. A. Micheli. 2009. Neural Network for Graphs: A Contextual Constructive Approach. IEEE Transactions on Neural Networks 20:3, 498-511. [CrossRef] 3. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 4. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef]
Communicated by Terrence Sejnowski
Hybrid Modeling, HMM/NN Architectures, and Protein Applications Pierre Baldi Division of Biology, Cnlifornin lizstitiite of Tecl~tzoloxy,Pnsndenn, CA 91125 USA Yves Chauvin Net-ID, Inc., Snrr Fmncisco, CA 94107 USA
We describe a hybrid modeling approach where the parameters of a model are calculated and modulated by another model, typically a neural network (NN), to avoid both overfitting and underfitting. We develop the approach for the case of Hidden Markov Models (HMMs), by deriving a class of hybrid HMM/NN architectures. These architectures can be trained with unified algorithms that blend HMM dynamic programming with NN backpropagation. In the case of complex data, mixtures of HMMs or modulated HMMs must be used. NNs can then be applied both to the parameters of each single HMM, and to the switching or modulation of the models, as a function of input or context. Hybrid HMM/NN architectures provide a flexible NN parameterization for the control of model structure and complexity. At the same time, they can capture distributions that, in practice, are inaccessible to single HMMs. The HMM/NN hybrid approach is tested, in its simplest form, by constructing a model of the immunoglobulin protein family. A hybrid model is trained, and a multiple alignment derived, with less than a fourth of the number of parameters used with previous single HMMs.
1 Introduction: Hybrid Modeling
One fundamental step in scientific reasoning is the inference of parameterized probabilistic models to account for a given data set D. If we identify a model M ( B ) ,in a given class, with its parameter vector 0, then the goal is to approximate the distribution P(QlD),and often to find its mode maxHP(BID).Problems, however, arise whenever there is a mismatch between the complexity of the model and the data. Too complex models result in overfitting; too simple models result in underfitting. The hybrid modeling approach attempts to finess both problems. When the model is too complex, it is reparameterized as a function of Neicrd
Corrrplifattoii 8,
1541-1565 (1996) @ 19Y6 Massachusetts Institute of Technology
1542
Pierre Balcli and K w Chauviii
a simpler parameter vector w,so that H = f ( i [ ) \ . ’When the data are too complex, short of resorting to a different model class, the only solution is to model the data with several M(H)s,with H varying discretely or continuously across different regions of data space. Thus the parameters must be modulated as a function of input, or context, in the form /i -~ f ” ) . In the general case, both may be desirabicl, so that H -fiici.I). This approach is hybrid, in the stmse that the functionf can belong t o a different model class. Since neural networks (NN) have well-known universal approximation properties, a natural approach is t o compute f lvith a n NN, but other representations are possible. Phis approach is also hierarchical because model reparameterizations can easily be nested at s r \ w a l levels. Here, for simplicity, MY confine ourselves to a single level of reFarameteriraticin. For concreteness, we focus on a particular class o f probabilistic modcds, namely Hidden Markov Models (HMMs), and their application in molecular biology. To ol’ercome the limitations of simple HMMs, we propose to use hybrid HMM/NN architectures’ that combine the expressive power of artificial N N s with the sequential time series aspect o f HMMs. It is, of course, not the first time HMMs and NNs are combined. Hybrid architectures have been used both in speech and cursive handwriting recognition (Bourlard and Morgan 1994; Cho and Kim 1995). In man!- of these applications, however, NNs are used as front end procc’ssors to extract features, such as strokes, characters, and phonemes. HMMs are then used in higher processing stages for word and language modeling.’ The HMM and NN components are often trained separately, although there are some exceptions (Bengio r t 171. 1995). A different type of hybrid architecture is also described in Cho and Kim (1995),where the NK component is used to classify the pattern of likelihoods produced by several HMMs. Here, in contrast, the HMM and NN components are inseparable. This yields, among other things, unified training algorithms where the HMM dynamic programming and the NN backpropagation blend together. In what follows, we first brieflv review HMMs, how they can be used to model protein families, and their limitations. In Section 3, we develop HMM/NN hybrid architectures for single models, to address the problem of parameter complexity and control or olwfitting. Simulation results are presented in Section 4 for a simple HMM/NN hybrid architecture used ‘Classical Bayesian hierarchical modeling relies on the description ot a parameterized prior P,, I W ) , where ( I are the hyperparanieter>. This is relateci to the present situation f! :- f (i(xt, provided a prior P(;oi is defined on the neu’ parameters. ’HMM/NiY architectures v,wc tirst described ‘it a NIPS44 workshop (Vail, CO) a n d at thc lntcrnational Symposium 0 1 1 Fifth Generation ConipLiter Systems (Tokyo, Japan), in December 1994. Preliminary versions ivere published in the Proceedings of the Srmiposium, a n d in the I’roceedings o f the ISM695 Conference. ’In the nicrlcndar biology applications to be considercd, N N s could conceivably he uwd to interprct the analog output of t-arious sequencing machines, hut this is definitely not tht, focus here.
Hybrid Modeling
1543
to model a particular protein family (immunoglobulins). In Section 5, we discuss HMM/NN hybrid architectures for multiple models, to address the problem of long-range dependencies or underfitting. 2 HMMs of Protein Families Many problems in computational molecular biology can be cast in terms of statistical pattern recognition and formal languages (Searls 1992). The increasing abundance of sequence data creates a favorable situation for machine learning approaches, where grammars are learned from the data. In particular, HMMs are equivalent to stochastic regular grammars and have been extensively used to model protein families and DNA coding regions (Baldi et al. 1994a,b; Krogh ef al. 1994a; Baldi and Chauvin 1994a; Krogh et al. 1994b). Proteins consist of polymer chains of amino acids. There are 20 important amino acids, so that proteins can be viewed as strings of letters, over a 20-letter alphabet. Protein sequences with a common ancestor share functional and structural properties, and can be grouped into families. Aligning sequences in a family is important, for instance to detect highly conserved regions, or motifs, with particular significance. Multiple alignment of highly divergent families where, as a result of evolutionary insertions and deletions, pairs of sequences often share less than 20"/0 amino acids, is a highly nontrivial task (Meyers 1994). A first-order discrete HMM can be viewed as a stochastic generative model defined by a set of states S, an alphabet A of M symbols, a probability transition matrix T = (f,,), and a probability emission matrix E = (e,x). The system randomly evolves from state to state, while emitting symbols from the alphabet. When the system is in a given state i, it has a probability t,, of moving to state j, and a probability eJXof emitting symbol X . As in the application of HMMs to speech recognition, a family of proteins can be seen as a set of different utterances of the same word, generated by a common underlying HMM. One of the standard HMM architectures for protein applications (Krogh et al. 1994a), is the left-right architecture depicted in Figure 1. The alphabet has M = 20 symbols, one for each amino acid (M = 4 for DNA or RNA models, one symbol per nucleotide). In addition to the start and end state, there are three classes of states: the main states, the delete states, and the insert states with S = {start. m l . . . . . i n N . i,. . . . . iN+l.d l . . . . . d N . end}. N is the length of the model, typically equal to the average length of the sequences in the family. The main and insert states always emit an amino-acid symbol, whereas the delete states are mute. The linear sequence of state transitions start i tT11 + r n 2 . . . i t n N i end is the backbone of the model. For each main state, corresponding insert and delete states are needed to model insertions and deletions. The self-loop on the insert states allows for multiple insertions at a given site.
Pierre Baldi and Yves Cliauvin
1544
Figure 1: Example of HMM architecture used in protein modeling. S is the start state, E the end state. d,, ni,, and i, denote delete, main, and insert states, respectively. 2.1 Learning Algorithms. Given a sample of K training sequences . . . . OR, the parameters of an HMM can be iteratively modified, in an unsupervised way, to optimize the data fit according to some measure, usually based on the likelihood of the data. Since the sequences can be considered as independent, the overall likelihood is equal to the product of the individual likelihoods. Two target functions, commonly used for training, are the negative log-likelihood: 01.
k
Q=
-
c
1InP(0k)
k- 1
k=l
I(
Qk =-
(2.1)
and the negative log-likelihood based on the optimal paths:
-x&= h
Q=
h
-~lnP[.rr(O~.)]
k= 1
(2.2)
k= 1
where ~ ( 0 is the ) most likely HMM production path for sequence 0. ~ ( can 0 )be computed efficiently by dynamic programming (Viterbi algorithm). Depending on the situation, the Viterbi path approach can be considered as a fast approximation to the full maximum likelihood, or as an algorithm in its own right. This can be the case in protein modeling where, as described below, the optimal paths play an important role.
Hybrid Modeling
1545
When priors on the parameters are included, one can also add regularizer terms to the objective functions for maximum a posteriori (MAP) estimation. Different algorithms are available for HMM training, including the Baum-Welch or expectation-maximization (EM) algorithm, and different forms of gradient descent and other generalized EM (GEM) (Dempster et al. 1977; Rabiner 1989; Baldi and Chauvin 1994a) algorithms. In the Baum-Welch algorithm, the parameters are updated according to (2.3)
where m, = C v m , ~(respectively n, = C,M,,) and mix (respectively iz,,) are the normalized4 expected emission (respectively transition) counts, induced by the data, that can be calculated using the forward-backward dynamic programming procedure (Kabiner 1989), or the Viterbi paths in Viterbi learning. As for gradient descent, and other GEM algorithms, a useful reparameterization (Baldi and Chauvin 1994b), in terms of normalized exponentials consists of (2.4) with w;,and Z J , ~ as the new variables. This reparameterization has two advantages: (1) modification of the 70s and us automatically preserves normalization constraints on emission and transition distributions; and ( 2 ) transition and emission probabilities can never reach the absorbing value 0. The on-line gradient descent equations on the negative loglikelihood are then
where 11 is the learning rate. The variables n,,, n,, mix, nz, are again the expected counts derived by the forward-backward procedure, for each single sequence if the algorithm is to be used on-line. Similarly, in Viterbi learning, at each step along a Viterbi path, and for any state i on the path, the parameters of the model are updated according to
Ti, = 1 (respectively E,x = 1) if the i + j transition (respectively emission of X from i) is used, and 0 otherwise. The new parameters are therefore updated incrementally, using the discrepancy between the frequencies induced by the training data and the probability parameters of the model. ?Unlike in Baldi and Chauvin (1994b), throughout this paper we use the more classical notation of (Rabiner 1989) where the counts, for a given sequence, automatIcallv incorporate a normalization by the probability P ( 0 ) of the sequence itself.
Pierre Baldi and Yves Chauviii
1516
Regardless of the training method, once an HMM has been successfully trained on a family of sequences, it can be used in a number of tasks. For instance, for any given sequence, one can compute its most likely path, as well as its likelihood. A multiple alignment results immediately from aligning all the optimal paths. The likelihoods can be used for discrimination tests and data base searches (Krogh et al. 1994a; Baldi and Chauvin 19941). In the case of proteins, HMMs have been successfully applied to several families such as globins, immunoglobulins, kinases, and G-protein-coupled receptors. In most cases, HMMs have performed well on all tasks yielding, for instance, multiple alignments that are comparable to those derived by human experts. 2.2 Limitations of HMMs. In spite of their success in various applications, HMMs can suffer from two weaknesses. First, they often have a large number of unstructured parameters. In the case of protein models, the architecture of Figure 1 has a total of approximately 49N parameters (40N emission parameters and 9N transition parameters). For a typical protein family, N is of the order of a few hundreds, resulting immediately in models with over 10,000 tree parameters. This can lead to orwfitting when only a few sequences are available,5 not an uncommon situation in early stages of genome projects. Second, first-order HMMs are limited with respect to dependencies between hidden states, found in most interesting problems. Proteins, for instance, fold into complex 3D shapes, essential to their function. Subtle long-range correlations in their polypeptide chains may exist that are not accessible to a single HMM. For instance, assume that whenever X is found at position i, it is generally followed by Y at position j; and whenever X' is found at position i, I t tends to be followed by Y' at j . A single HMM has typically twofisrd emission vectors associated with the i and j positions. Therefore it cannot capture such correlations. Related problems are also the nonstationarity of complex time series, as well as the variability often encountered in "speaker-independent" recognition problems. Only a small fraction of distributions over the space of possible sequences, essentially the factorial distributions, can be represented by a reasonably constrained HMM." 3 HMM/NN Hybrid Architectures: Single Model Case 3.1 Basic Idea. In a general HMM, an emission or transition vector H is a function of the state i only: H =f(i). The first basic idea is to have ~
~~
51t should be noted, however, that a typical sequence provides o n the order of 2N constraints, and 25 sequences or so provide a number of examples in the same range as the number of HMM parameters. 'Any distribution can be represented by a single cxyorirritial s i x HMM, with a start state connected to different sequences of deterministic states, one for each possible alphabet sequence, with a transikm probability equal to the probability o f the s ~ q ~ w i ~ c c i tse I f.
Hybrid Modeling
1547
a NN on top of the HMM, for the computation of the HMM parameters, that is for the computation of the functionf. NNs are universal approximators, and, therefore, can represent any f. More importantly perhaps, NN representations enable the flexible introduction of many possible constraints. For simplicity, we discuss emission parameters only, but the approach extends immediately to transition parameters as well. In the reparameterization of 2.4, we can consider that each one of the HMM emission vectors is calculated by a small NN, with one input set to one (bias), no hidden layers, and 20 softmax output units (Fig. 2a). The connections between the input and the outputs are the This can be generalized immediately by having arbitrarily complex NNs, for the computation of the HMM parameters. The NNs associated with different states can also be linked with one or several common hidden layers, the overall architecture being dictated by the problem at hand. In the case of a discrete alphabet however, such as for proteins, the emission of each state is a multinomial distribution, and, therefore, the output of the corresponding network should consist of M softmax units. As a simple example, consider the hybrid HMM/NN architecture of Figure 2b consisting of the following: Z J , ~ .
1. Input layer: one unit for each state i. At each time, all units are set to 0, except one which is set to 1. If unit i is set to 1, the network computes E , X , the emission distribution of state i. 2. Hidden layer: H hidden units indexed by h, each with transfer functionfi, (logistic by default) with bias bl, (H < M). 3. Output layer: M softmax units or weighted exponentials, indexed by X, with bias bx. 4. Connections: ( t = ( 0 1 , ~ ) connects input position i to hidden unit 11. !I = (.jx,l) connects hidden unit h to output unit X.
For input i, the activity in the 11th unit in the hidden layer is given by fil(0ill
+ be)
(3.1)
The corresponding activity in the output layer is
For hybrid HMM/NN architectures, a number of points are worth noticing: 0
The HMM states can be partitioned into different groups, with different networks for different groups. In protein applications, for instance, one can use different NNs for insert states and for main states, or for different groups of states along the protein sequence corresponding, for instance, to different regions (hydrophobic, hydrophilic, alpha-helical, etc.).
Pierre Baldi and Yves Chauvin
1548
Output emission distributions
Input: HMM states
Fig. 2a
Output emission distribution
m
L input: HMM states
Fig. 2b Figure 2: (a) Schematic representation of siniple HMM / N N hybrid architecture used in Baldi Pt 171. (1994h). Each HMM state has its own NN. Here, the NNs are extremely simple, with no hidden la!*er, and an output layer of softmax units computing the state emission, or transition, parameters. Only o u t p u t emissions are represented for simplicit!: (b) Schematic representation of an HMM/NN xchitecture Lvhere the NNs associated with different states (or different groups o f states) are connected via m e o r several hidden layers.
Hybrid Modeling 0
0
0
0
0
1549
HMM parameter reduction can easily be achieved using small hidden layers with H hidden units, and H small compared to N or M. In the example of Figure 2b, with H hidden units and considering only main states, the number of parameters is H(N + M) in the HMM/NN architecture, versus NM in the corresponding simple HMM. For protein models, this yields roughly HN parameters for the HMM/NN architecture, versus 20N for the simple HMM. H = M is equivalent to 2.4. The number of parameters can be adaptively adjusted to variable training set sizes, merely by changing the number of hidden units. This is useful in environments with large variations in data base sizes, as in current molecular biology applications. The entire bag of connectionist tricks can be brought to bear on these architectures, such as radial basis functions, multiple hidden layers, sparse connectivity, weight sharing, gaussian priors, and hyperparameters. Several initializations and structures can be implemented in a flexible way. For instance, by allocating different numbers of hidden units to different subsets of emissions or transitions, it is easy to favor certain classes of paths in the models, when needed. In the HMM of Figure 1, for instance, one must introduce a bias favoring main states over insert states, prior to any learning. It is easy also to tie different regions of a protein that may have similar properties by weight sharing, and other types of long-range correlations. By setting the output bias to the proper values, the model can be initialized to the average composition of the training sequences, or any other useful distribution. Classical prior information in the form of substitution matrices, for instance, is easily incorporated. Substitution matrices (Altschul 1991) can be computed from data bases, and essentially produce a background probability matrix P = ( p x y ) , where p x y is the probability that X be changed into Y over a certain evolutionary time. P can be implemented as a linear transformation in the emission NN. HMMs with continuous emission distributions are also easy to incorporate in the HMM/NN framework. The output emission distributions can be represented, for instance, in the form of samples, moments, and/or mixture coefficients. In the classical mixture of gaussians case, means, covariances, and mixture coefficients can be computed by the NN. Likewise, additional HMM parameters, such as exponential parameters to model the duration of stay in any given state, can be calculated by a NN.
With hybrid HMM/NN architectures, in general the M step of the EM algorithm, cannot be carried analytically. One can still use, however, some form of gradient descent using the chain rule, by computing the derivatives of the target likelihood functions 2.1 or 2.2 with respect to the HMM parameters, and then the derivatives of the HMM parameters
1550
Pierre Baldi and Yves Chauvin
with respect to the NN parameters. For completeness, a derivation of the learning equations for the HMM/NN architecture described above is given in the Appendix. In the resulting learning equations (A.3 and A.7), the HMM dynamic programming and the NN backpropagation components are intimately fused. These algorithms can also be seen as GEM (generalized EM) (Dempster ct 01. 1977)algorithms. They can easily be modified to MAP optimization with inclusion of priors.
3.2 Representation in Simple HMM/NN Architectures. Consider the particular HMM/NN described above, where a subset of the HMM states are fully connected to H hidden units, and the hidden units are fully connected to M softmax output units. The hidden unit bias is not really necessary in the sense that for any HMM state i, any vector of biases b,,, and any vector of connections ()I,,, there exists a new vector of connections o;,, that produces the same vector of hidden unit activations with 0 bias. This is not true in the general case, for instance, as soon as there are multiple hidden layers, or if the input units are not fully interconnected to the hidden layer. We have left the biases for the sake of generality, and also because even if the biases do not enlarge the space of possible representations, they may still facilitate the learning procedure. Similar remarks hold more generally for the transfer functions. With an input layer fully connected to a single hidden layer, the same hidden layer activation can be achieved with different activation functions, by modifying the weights. A natural question to ask is what is the representation used in the hidden layer, and what is the space of emission distributions achievable in this fashion? Each HMM state in the network can be represented by a point in the [-l.l]" hypercube. The coordinates of a point are the activities of the H hidden units. By changing its connections to the H hidden units, an HMM state can occupy any position in the hypercube. So, the space of achievable emission distributions is entirely determined by the connections from the hidden to output layer. If these connections are held fixed, then each HMM state can select a corresponding optimal position in the hypercube, where its emission distribution, generated by the NN weights, is as close as possible to the truly optimal distribution, for instance in cross-entropy distance. During on-line learning, all parameters are learned at the same time so this may introduce additional effects. To further understand the space o f achievable distributions, consider the transformation from hidden to output units. For notational convenience, we introduce one additional hidden unit numbered 0, always set to 1, to express the output biases in the form: bx = ,lxo. If, in this extended hidden layer, we turn a single hidden unit to 1, one a t a time, we obtain H + 1 different emission distributions in the output layer P" = [pi)
Hybrid
Modeling
1551
(0 5 h 5 H) with
Consider now a general pattern of activity in the hidden layer of the form (1.i l l . . . . . / i l l ) . Using 3.2 and 3.3, the emission distribution in the output layer is then (3.4) After simplifications, this yields (3.5)
Therefore, all the achievable emission distributions by the NN have the form of 3.5, and can be viewed as ”combinations” of H 1 fundamental distributions P” associated with each single hidden unit. In general, this combination is different from a convex linear combination of the P”s. It consists of three operations: (1) raising each component of P” to the power ir,,, the activity of the hth hidden unit, (2) multiplying all the corresponding vectors componentwise, and (3) normalizing. In this form, the hybrid HMM/NN approach is different from a mixture of Dirichlet distributions approach.
+
4 Simulation Results
Here we demonstrate a simple application of the principles behind HMM/NN hybrid architectures on the immunoglobulin protein family. Immunoglobulins, or antibodies, are proteins produced by B cells that bind with specificity to foreign antigens in order to neutralize them, or target their destruction by other effector cells. The various classes of immunoglobulins are defined by pairs of light and heavy chains that are held together principally by disulfide bonds7 (Fig. 3). Each light and heavy chain molecule contains one variable (V) region, and one (light) or several (heavy) constant (C) regions. The V regions differ among immunoglobulins, and provide the specificity of the antigen recognition. About one-third of the amino acids of the V regions form the hypervariable sites, responsible for the diversity of the vertebrate immune response. Our data base is the same as the one used in Baldi rt 111. (1994b), and consists of human and mouse heavy chain immunoglobulin V region sequences, from the Protein Identification Resources (PIR) data base. It contains 224 sequences, with minimum length 90, average length N = 117, and maximum length 254. ’Disulfide bonds are covalent bonds between two sulfur atoms in different amino acids (typically cysteines) of a protein that are important in determining secondary and tertiary structure.
1552
Pierre Baldi and Yves Chauvin
Figure 3: A model of the structure of a typical human antibody molecule, composed of two light and two heavy polypeptide chains. Interchain and intrachain disulfide bonds are indicated. Cysteine (C) residues are associated with the bonds. Two identical active sites for antigen binding, corresponding to the variable regions, are located in the arms of the molecule. (From Makcrrlor Biology of fhr Crrir. Vol. 11. Fourth Edition, by Watson et ol. Copyright @ 1987 by James D. Watson. Published by The Benjamin/Cummings Publishing Company.) For the immunoglobulin V regions, our results (Baldi et al. 199413) were obtained by training a simple HMM, similar to the one in Figure 1, containing a total of 52N 23 = 6107 adjustable parameters. Here w e train a hybrid H M M / N N architecture with the following characteristics. The basic model is a n HMM with the architecture of Figure 1. All the main state emissions are calculated by a common NN, with 2 hidden units. Likewise, all the insert state emissions are calculated by a common NN, with one hidden unit only. Each state transition distri-
+
Hybrid Modeling
1553
bution is calculated by a different softmax network, as in our previous work. With edge effects neglected, the total number of parameters of this HMM/NN architecture is 1507 (117 x 3 x 3 = 1053 for the transitions, (117x 3+3+3x 20+40) = 454 for the emissions, including biases). This architecture is not at all optimized: for instance, we suspect we could have significantly reduced the number of transition parameters. Our goal at this time is only to demonstrate the general HMM/NN principles, and test the learning algorithm. The hybrid architecture is then trained on-line, using both gradient descent (A.3), and the Viterbi version (A.7). The training set consists of a random subset of 150 sequences, identical to the training set used previously. There, emission and transition parameters were initialized uniformly. Here, the input-to-hidden weights are initialized with independent gaussians, with mean 0 and standard deviation 1. The hiddento-output weights are initialized to 1. This yields a uniform emission probability distribution on all the emitting states.8 Notice also that if all the weights are initialized to 1, including those from input to hidden layer, then the hidden units cannot differentiate from each other. The transition probabilities out of insert or delete states are initialized uniformly to 1/3. We introduce, however, a small bias along the backbone that favors main to main transitions, in the form of a Dirichlet prior. This prior is equivalent to introducing a regularization term in the objective function, equal to the logarithm of the backbone transition path. The regularization constant is set to 0.01, and the learning rate to 0.1. Typically, 10 training cycles are more than sufficient to reach equilibrium. In Figure 4, we display the multiple alignment of 20 immunoglobulin sequences, selected randomly from both the training and validation sets. The validation set consists of the remaining 74 sequences. This alignment is very stable between 5 and 10 epochs.’ It corresponds to a model trained by A.7 for 10 epochs. While there is currently no universally accepted measure of the quality of an alignment, the present alignment is similar to the previous one, derived with a simple HMM with more than four times as many parameters. The algorithm has been able to detect most of the salient features of the family. Most importantly, the cysteine residues (C) toward the beginning and the end of the region (positions 24 and 100 in this alignment), which are responsible for the disulfide bonds that hold the chains, are perfectly aligned. The only exception is the last sequence (PH0097), which has a serine (S) residue in its terminal portion. This is a rare but recognized exception to the conservation of this position. Some of the sequences in the family came with a ”header” (transport signal peptide). We did not remove the headers XWithViterbi learning, this is probably better than a nonuniform initialization, such as the average composition. A nonuniform initialization may introduce distortions in
the Viterbi paths. YDifferenceswith the alignment published in the 1SMB95 Proceedings result from differences in regularization, and not in the number of training cycles.
Pierre Baldi and Yws Chauvin
1554
prior to training. The model is capable of detecting and accommodating these headers, by treating them as initial repeated inserts, as can be seen from the alignment of three of the sequences (S09711, A36194, S11239). This multiple alignment contains also a few isolated problems, related in part to the overuse of gaps and insert sates. Interestingly, this is most evident in the hypervariable regions, for instance at positions 30-35 and 50-55. These problems should be eliminated with a more careful selection of hybrid architecture and/or regularization. Alignments did not improve using A.3 and/or a larger number of hidden units, up to 4. In Figure 5, we display the activity of the two hidden units associated with each main state (see 3.2). For most states, at least one of the activities is saturated. The activities associated with the cysteine residues responsible for the disulfide bridges (main states 24 and 100) are all saturated, and in the same corner (-1. + l ) .Points close to the center (0.0) correspond to emission distributions determined by the bias only. For the main states, the three emission distributions of equation 3.3, associated with the bias and the two hidden units, are given by
P"
=
(0.442.0.000.0.005.0.000.0.001.0.000.0.004.0.002.0.133. 0.000.0.000.0.000.0.000.0.113.0.195.0.000.0.104.0.001. 0.000.0.000)
P'
=
(0.000.0.000.0.000.0.036.0.000.0.Y00.0.000.0.000.0.000.
0.000.0.000.0.000.0.037.0.000.0.000.0.000.0.000. 0.000.0.027) and
P.'
=
(0.000.0.040.0.000.0.000.0.000.0.000.0.000.0.000.0.000. 0.942.0.001.0.000.0.016.0.000.0.000.0.000.0.000.0.001. 0.000.0.000)
using alphabetical order on single-letter amino acid symbols. 5 Discussion: The Case of Multiple Models
The hybrid HMM/NN architectures described address the first limitation of HMMs: the control of model structure and complexity. No matter how complex the NN component, however, the final model so far remains a single HMM. Therefore the second limitation of HMMs, long-range dependencies and underfitting, remains. This obstacle cannot be overcome by simply resorting to higher-order HMMs. Most often these are computationally intractable. A possible approach is to try to introduce a new state for each relevant context. This requires a systematic method for determining relevant contexts of variable lengths, directly from the data. Furthermore, one must
Hybrid Modeling
1555
_____~ ~
F37262 a27563 C30560 GlHUUW
SO9711 a36006 F36005 A36194 A31485
033548 AWSJ5 030560
S11239 GlMSAA I27868 PL0118 PL0122 A33989 A30502
PH0091
FV262 B27563 C30560 GlHUOW so9111 B360Cb
F36005 A36194
A31485 033548 AVMSJ5
030560 S11239 GlMSAA 127888
PL0118 PL0122 A339B9 A30502
PH0097
F37262 B27563 C30560 GlHIiDd
s09111 B36006 F36005 A36194 a31485 033548 AVMSJ5 030560 511239
GIMSAA 127888 PL0118 PL0122 A33969
A30502 ?HOOP7
Figure 4: Multiple alignment of 20 immunoglobulin sequences, randomly extracted from the training and validation data sets. Validation sequences: F37262, GlHUDW, A36194, A31485, D33548, 511239, 127888, A33989, A30502. Alignment is obtained with a hybrid HMM/NN architecture trained for 10 cycles, with two hidden units for the main state emissions, and one hidden unit for the insert state emissions. Lower case letters correspond to emissions from insert states. Notice the initial header (transport signal peptide) on some of the sequences, captured as repeated transitions through the first insert state in the model. The cysteines (C), associated with the disulfide bridge, in columns 24 and 100, are perfectly aligned (PH0097 is a known biological exception).
Pierre Baldi and Yves Chauvin
1556
.i . * .
.. I
-1 0
.: . . . I
-0.5
”*.. I
I
00
05
i 10
hl
Figure 5: Activity of the two hidden units associated with the emission of the main states. The two activities associated with the cysteines (C) are in the upper left corner, almost overlapping, with coordinates (-1, +l).
hope the number of relevant contexts remains small. An interesting approach along these lines can be found in Ron et al. (1994), where English is modeled as a Markov process with variable memory length of up to 10 letters or so. To address the second limitation without resorting to a different model class, one must consider more general HMM/NN hybrid architectures, where the underlying statistical model is a set of HMMs. To see this, consider again the X - Y/X’ - Y’ problem. To capture such dependencies requires mriable emission vectors at the corresponding locations, together with a linking mechanism. In this simple case, four different emission vectors are needed: e,, e,, e: and e;. Each one of these vectors must assign a high probability to the letters X, Y, X’, and Y’, respectively. More importantly, there must be some kind of memory, so that el and e, are used for sequence 0, and e: and e( are used for sequence 0’. The combination of el and e,l (or e: and e,) should be rare or not allowed, unless required by the data. Thus el and e, must belong to a first HMM, and e: and el to a second HMM, with the possibility of switching from one HMM to the other, as a function of input sequence. Alternatively,
Hybrid Modeling
1557
there must be a single HMM, but with variable emission distributions, modulated again by some input. In both cases then, we consider that the emission distribution of a given state depends not only on the state itself, but also on an additional stream of information I. That is now H = f ( i . I ) . In a multiple HMM/NN hybrid architecture, f can be computed again by a NN. Depending on the problem, the input I can assume different forms, and may be called ”context” or ”latent variable.” When feasible, I may even be equal to the currently observed sequence 0. Other inputs are, however, possible, over different alphabets. An obvious candidate in protein modeling tasks would be the secondary structure of the protein (rv-helices, J-sheets, and coils). In general, I could also be any other array of numbers representing latent variables for the HMM modulation (MacKay 1994). We shall now describe, without any simulations, two simple but somewhat canonical architectures of this sort. Learning is briefly discussed in the Appendix. 5.1 Example 1: Mixtures of HMM Experts. A first possible approach is to put an HMM mixture distribution on the sequences. With M HMMs M I . . . . .M,,,
where C, A, = 1, and As are the mixture coefficients. Similarly, the Viterbi likelihood is max, A , P [ T M , (O)]. In generative mode, sequences are produced at random by each individual HMM, and MI is selected with probability A,. Such a system can be viewed as a larger single HMM, with a starting state connected to each one of the HMMs M , , with transition probability A, (Fig. 6). This type of model is used in Krogh ef al. (1994a) for unsupervised classification of globin protein sequences. Notice that the parameters of each submodel can be computed by an NN to create an HMM/NN hybrid architecture. Since the HMM experts form a larger single HMM, the corresponding hybrid architecture is also identical to what we have seen in the section on single HMMs. The only peculiarity is that states have been replicated, or grouped, to form different submodels. One further step is to have variable mixture coefficients that depend on the input sequence, or some other relevant information. These mixture coefficients can be computed as softmax outputs of an NN, as in the mixture of experts architecture of Jacobs rt nl. (1991). 5.2 Example 2: Mixtures of Emission Experts. A different approach is to modulate a single HMM by considering that the emission parameters e,x should also be function of the additional input I. So eIx = P(i. X.I ) .
Pierre Baldi and Yves Chauvin
1558
. s
*
- E
Figure 6: Schematic representation of the type of multiple HMM architecture used in Krogh et d.(1994a) for discovering subfamilies within a protein family. Each "box," between the start and end states, corresponds to an HMM with the architecture of Figure 1. Without any loss of generality, we can assume that P is a mixture of emission experts P,:
II
11
P(i. X.I ) = C A,(i. X.I)P,ji.X.I )
(5.2)
/=I
In many interesting cases, A, is independent of X , resulting in the probability vector equation, over the alphabet:
(5.3) If n = 1 and P( i. I ) = P( i), we are back to a single HMM. An important special case is derived by further assuming that A, does not depend on i, and PI(i. X. I) does not depend on I explicitly. Then (5.3) This provides a principled way for designing the top layers of general hybrid HMM/NN architectures, such as the one depicted in Figure 7.
Hybrid Modeling
1559
Output emission distrlbution
Emlsslon experts
Control network
I Input: HMM states
I Input: external or context
Figure 7: Schematic representation of a general HMM/NN architecture, where the HMM parameters are computed by an NN of arbitrary complexity, that operates on state information, but also on input or context. The input or context is used to modulate the HMM parameters, for instance, by switching or mixing different parameter experts. For simplicity, only emission parameters are represented, with three emission experts, and a single hidden layer. Connections from the HMM states to the control network, and from the input to the hidden layer, are also possible. The components P, are computed by a NN, and the mixture coefficients by another gating NN. Naturally, many variations are possible and, in the most general case, the switching network can depend on the state i, and the distributions P, on the input 1. In the case of protein modeling, for instance, if the switching depends on position i, the emission experts could correspond to different types of regions, such as hydrophobic and hydrophilic, rather than different subclasses within a protein family.
6 Conclusion
A large class of hybrid HMM/NN architectures has been described. These architectures improve on single HMMs in two complementary directions. First, the NN reparameterization provides a flexible tool for the
1560
Pierre Baldi and Yves Chauvin
control of overfitting, the introduction of priors, and the construction of an input-dependent mechanism for the modulation of the final model. Second, modeling a data set with multiple HMMs allows for the coverage of a larger set of distributions, and the expression of non-stationarity and correlations inaccessible to single HMMs. We recently found out that related ideas have been proposed independently in Bengio and Frasconi (1995), but from a different viewpoint in terms of input/outyut HMMs. Not surprisingly, these ideas are also related to data compression, information complexity, factorial codes, autoencoding and generative models [for instance, Dayan ct al. (1995), and references therein]. The concept of hybrid HMM/NN architecture has been demonstrated, in its simplest form, by providing a model of the immunoglobulin family. The HMM/NN approach is meant to complement rather than substitute many of the already existing techniques for incorporating prior information in sequence models. Additional work is required to develop optimal architectures and learning algorithms, and to test them on more challenging protein families and other domains. Two important issues for the success of a hybrid HMM/NN architecture on a real problem are the design of the NN architecture, and the selection of the external input or context. These issues are problem dependent and cannot be dealt with generally. We have described some examples of architectures using mixture ideas for the design of the NN component. Different input choices are possible, such as contextual information or latent lwiables, sequences over a different alphabet (for instance strokes versus letters in handwriting recognition), or just real vectors, in the case of manifold parameterization (MacKay 1994). As pointed out in the introduction, the ideas presented here are not limited to HMMs, or to protein or DNA modeling. They can be viewed in a more general framework, where a class of parameterized model is first constructed for the data, and then the parameters of the models are calculated, and possibly modulated, by one or several other NNs (or any other flexible reparameterization). I n fact, several examples of simple hybrid architectures can be found scattered throughout the literature. A classical case consists of binomial (respectively multinomial) classification models, where membership probabilities are calculated by a NN with a sigmoidal (respectively normalized exponential) output (Rumelhart ef nl. 19%). Other examples are the rnaster-sla\Te approach of Lapedes and Farber (1986), and the sigrnoidal belief networks in Neal (1992), where NNs are used to compute the weights of another NN, or the conditional distributions of a belief network. Although the principle of hybrid modeling is not new, by exploiting it systematically in the case of HMMs, we have generated new classes of models. There are other classes where the principle has not been applied systematically yet. As an example, it is well known that HMMs are equikralent to stochastic regular grammars. The next level in the Chomsky hierarchy is context-free grammars (SCFGs). One can consider hybrid SCFG/NN architectures, where a NN
Hybrid Modeling
1561
is used to compute the parameters of a SCFG, and/or to modulate or mix different SCFGs. Such hybrid grammars might be useful, for instance, in extending the work of Sakakibara et al. (1994), on RNA modeling. Finding optimal architectures for molecular biology applications and other domains, and developing a better understanding of how probabilistic models should be "-modulated as a function of input or context, are some of the main challenges for hybrid approaches.
7 Appendix 7.1 Learning for Simple HMM/NN Architectures. Here we give online equations (batch equations can be derived similarly). For a sequence 0, we need to compute the partial derivatives of lnP(O), or l n P [ ~ ( 0 ) ] , with respect to the parameters ci, $, and b of the network.
7.1.2 Gradient LearningonFullLikelihood. Let Q ( 0 ) = l n P ( 0 ) . If m l x ( 0 ) is the normalized count for the emission of X from i for 0, derived using the forward-backward algorithm (Baldi and Chauvin 1994b) then
so that
The partial derivatives with respect to the network parameters (r, ;I, and b can be obtained by the chain rule, that is by backpropagating through the network for each i. For each 0 and i, the resulting on-line learning equations are
with h,, = 1, and hi, = 0 for j # i. The full gradient results by summing over all sequences, and all main states. For instance,
and similarly for ,j, and the biases. It is worth noticing that these equations are slightly different from those obtained by gradient descent on the local cross-entropy between the emission distribution eIx and the target distribution mlxlm,.
Pierre Baldi and Yves Chaurin
1562
7.1.2 VitrrbiLcnrning. Here Q ( 0 )= lnP[T(O)].The component of this term that depends on emission from main states, and thus on O , , j , and b, along the Viterbi path T = 7r(O), is given by -
C
Inelx =
(1.X)En
C
(I
X)EX
1'1 Y
~ i ln x 2= -
Ta
C1 ~,yIn Y TIY fly
-
('4.5)
lea
where T I x is the target: T l x = 1 if X is emitted from main state i in T ( O ) , and 0 otherwise. Thus computing the gradient of Q ( 0 ) = - lnP[x(O)] with respect to o, {j,and b is equivalent to computing the gradient of the local cross entropy
between the target output and the output of the network, over all i in T. This cross-entropy error function, combined with the softmax output unit, is the standard NN framework for multinomial classification (Rumelhart c't 01. 1995). In summary, the relevant derivatives can be calculated online both with respect to the sequences 01.. . . . OKand, for each sequence, with respect to the Viterbi path. For each sequence 0, and for each main , corresponding contribution to state i on the Viterbi path T = T ( O ) the the derivative can be obtained by standard backpropagation on H ( T , .e l ) . The Viterbi on-line learning equations, similar to (A.3), are given by
I
A)j\~i
//(Ti, - ~ 7 i ~ f i i ( + ~ ~ 6\1) ~ l l
Ah,
=
T/(TI>- f i , )
AOll,
=
h,&(WIl
Ah
=
yc(ol,!
+ b l l ) [ C Y '47l(TlY - f l 1 11
+ hl)"~XIl(1
-
f1x)
-
(A.7)
Cr+x'~,llflYl
for (i. X ) E T ( O ) with , = 1, and TIY= 0 for Y # X. The full gradient is obtained again by summing over all sequences, and all main states present in the corresponding Viterbi paths. For instance,
and similarly for
t jand
the biases.
7.2 Learning for General HMM/NN Architectures. For a given setting of all the parameters, for a given observation sequence, and for a given input vector I, the general HMM/NN hybrid architectures reduce to a single HMM. The likelihood of a sequence, or some other measure of its fitness, with respect to such HMM, can be computed by dynamic programming. As long as it is differentiable in the model parameters, we can then backpropagate the gradient through the NN, including through the portion of the network depending on I , such as the gating network of Figure 4. With minor modifications, this leads to learning algorithms
Hybrid Modeling
1563
similar to those described above. This form of learning encourages cooperation between the emission experts of Figure 7. As in the usual mixture of experts architecture of Jacobs ef al. (1991), it may be useful to introduce some degree of competition between the experts, so that each one of them specializes, for instance, on a different subclass of sequences. When the relevant input or hidden variable I is not known, it can be learned together with the model parameters using Bayesian inversion. Indeed, consider for instance the case where there is an input I associated with each observation sequence 0, and a hybrid model with parameters 10, so that we can compute P ( 0 I 1 . z ~ ) .Let P ( I ) and P(zu) denote our priors on 1 and U J . Then
P ( I I 0.zo)
=
P ( 0 I I . zu)P(I) P ( 0 I 70)
with
P ( 0 I w)= / P ( 0 I I . zu)P(I)dI
(A.10)
The probability of the model parameters, given the data, can then be calculated, using Bayes theorem again: (A.ll)
assuming the observations are independent. These parameters can be optimized by gradient descent on - logP(zu I D ) . The main step is the evaluation of the likelihood P ( 0 I zu) (A.10), and its derivatives with respect to zo, which can be done by Monte Carlo sampling. The distribution on the latent variables I is calculated by A.9. The work of MacKay (1994) is an example of such a learning approach. The density network used for protein modeling can be viewed essentially as a special case of HMM/NN hybrid architecture, where each emission vector acts as a softmax transformation on a low-dimensional real "hidden" input I, with independent gaussian priors on 1 and zu. The input 1 modulates the emission vectors, and therefore the underlying HMM, as a function of sequence. 7.3 Priors. There are many ways to introduce priors in HMMs. Additional work is required to compare them to the present methods. For instance, it is natural to use Dirichlet priors (Krogh et al. 1994a) on multinomial distributions, such as emission vectors over discrete alphabets. It is easy to check that if a multinomial distribution is calculated by a set of normalized exponential output units, a gaussian prior on the weights of these units is in general not equivalent to a Dirichlet prior on the outputs.
1564
Pierre Baldi and Yves Chauvin
Acknowledgments The work of P.B. is supported by a grant from the ONR. The work of Y.C. is supported in part by Grant R43 LM05780 from the National Library of Medicine. The contents of this publication are solely the responsibility of the authors a n d do not necessarily represent the official views of the National Library of Medicine. References Altschul, S. F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 1-11. Baldi, P., and Chauvin, Y. 1994a. Hidden Markov models of the G-proteincoupled receptor family. J. Coitip. Biol. 1(4), 311-335. Baldi, P., and Chauvin, Y., 1994b. Smooth on-line learning algorithms for hidden Markov models. Neirrnl Coniy. 6 ( 2 ) ,305-316. Baldi, P., Brunak, S., Chauvin, Y., and Engelbrecht, J. 1994a. Hidden Markov models of human genes. In A d i m c r s i t i Neirrnl Zilfornintioii Processing S!ysterns, J. D. Cowan, G. Tesauro, and J. Alspector, eds., Vol. 6, pp. 761-768. Morgan Kaufmann, San Mateo, CA. Baldi, I?, Chauvin, Y., Hunkapillar, T., and McClure, M. 1994b. Hidden Markov models of biological primary sequence information. Proc. Nntl. Acnd. Sci. U.S.A. 91(3), 1059-1063. Bengio, Y., and Frasconi, P. 1995. An input-output HMM architecture. In Adzmct7s iti NLwrnl lnforimtiori Processiiig Systenis, J. D. Cowan, G., Tesauro, and J. Alspector, eds., Vol. 7. Morgan Kaufmann, San Mateo, CA. Bengio, Y., Le Cunn, Y., and Henderson, D. 1995. Globally trained handwritten word recognizer using spatial representation, convolutional neural networks and hidden Markov models. In Adzmices iii Nrirrnl It2fornintioti Processing S!ystmis, J. D. Cowan, G. Tesauro, and J. Alspector, eds., Vol. 6. Morgan Kaufmann, San Mateo, CA. Bourlard, H., and Morgan, N. 1994. Caiiiiectioriist Speech Rrcogiiitioti: A Hybrid Apprunclz. Kluwer Academic, Boston. Cho, S. B., and Kim, J. H. 1995. An HMM/MLP architecture for sequence recognition. Neiirnl Conip. 7, 358-369. Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. 1995. The Helmholtz machine. Neirrnl Conip. 7(5), 889-904. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stnt. Sac. B39, 1-22. Jacobs, R. A,, Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Ncirrnl Coirip. 3, 79-87. Krogh, A,, Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. 1994a. Hidden Markov models in computational biology: Applications to protein modeling. ]. Mu/.B i d . 235, 1501-1531. Krogh, A,, Mian, I. S., and Haussler, D. 1994b. A hidden Markov model that finds genes in E. roli DNA. Nirrlf+ Arid Rrs. 22, 47684778.
Hybrid Modeling
1565
Lapedes, A., and Farber, R. 1986. A self-optimizing, nonsymmetrical neural net for content addressable memory and pattern recognition. Physicn 22D, 247-259. MacKay, D. J. C. 1994. Bayesian neural networks and density networks. Proceedings of Workshop on Neiitrori Scnttering Datn Aiialysis mid Proceedings of 1994 MnxEnt Conference, Cambridge (UK). Myers, E. W. 1994. An overview of sequence comparison algorithms in molecular biology. Protein Sci. 3(1), 139-146. Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. I E E E 77(2),257-286. Neal, R. M. 1992. Connectionist learning of belief networks. Artificinl Iiifelligmce 56, 71-113. Ron, D., Singer, Y., and Tishby, N. 1994. The power of amnesia. In Advnriccs in Neurnl lnformntion Processing Systems, J. D. Cowan, G. Tesauro, and J. Alspector, eds., Vol. 6. Morgan Kaufmann, San Mateo, CA. Rumelhart, D. E., Durbin, R., Golden, R., and Chauvin, Y.1995. Backpropagation: The basic theory. In Backpropagation: Theory, Architectirres mid Applicntions, pp. 1-34. Lawrence Erlbaum, Hillsdale, NJ. Sakakibara, Y., Brown, M., Hughey, R., Saira Mian, I., Sjolander, K., Underwood, R. C., and Haussler, D. 1994. The application of stochastic context-free grammars to folding, aligning and modeling homologous RNA sequences. UCSC Technicnl Report UCSC-CRL-94-14. Searls, D. 3. 1992. The linguistics of DNA. Am.Sci. 80, 579-591. ~~~
~
Received February 2, 1995, accepted February 8, 1996
This article has been cited by: 2. Anders Krogh , Søren Kamaric Riis . 1999. Hidden Neural NetworksHidden Neural Networks. Neural Computation 11:2, 541-563. [Abstract] [PDF] [PDF Plus]
ARTICLE
Communicated by Alain Destexhe
Synchronized Action of Synaptically Coupled Chaotic Model Neurons Henry D. I. Abarbanel Department of Physics and Marine Physical Laborato y, Scripps Institution of Oceanography, University of California-Sun Diego, La Jolla, C A 92093-0402 U S A lnstitutefor Nonlinear Science, University of California-Sun Diego, La Jolla, C A 92093-0402 U S A
R. Huerta lnstitutefor Nonlinear Science, University of California-Sun Diego, La lolla, C A 92093-0402 U S A M. I. Rabinovich lnstitutefor Nonlinear Science, University of California-Sun Diego, La Jolla, C A 92093-0402 U S A N. F. Rulkov lnstitutefor Nonlinear Science, University of California-Sun Diego, La folla, C A 92093-0402 U S A
P. F. Rowat Department of Biology, University of California-Sun Diego, La ]olla, C A 92093-0357 U S A
A. I. Selverston Department of Biology, University of California-Sun Diego, La Jolla, C A 92093-0357 U S A Experimental observations of the intracellular recorded electrical activity in individual neurons show that the temporal behavior is often chaotic. We discuss both our own observations on a cell from the stomatogastric central pattern generator of lobster and earlier observations in other cells. In this paper we work with models of chaotic neurons, building on models by Hindmarsh and Rose for bursting, spiking activity in neurons. The key feature of these simplified models of neurons is the presence of coupled slow and fast subsystems. We analyze the model neurons using the same tools employed in the analysis of our experimental data. Neural Computation 8, 1567-1602 (1996) @ 1996 Massachusetts Institute of Technology
1568
Henry D. I. Abarbanel et al.
We couple two model neurons both electrotonically and electrochemically in inhibitory and excitatory fashions. In each of these cases, we demonstrate that the model neurons can synchronize in phase and out of phase depending on the strength of the coupling. For normal synaptic coupling, we have a time delay between the action of one neuron and the response of the other. We also analyze how the synchronization depends on this delay. A rich spectrum of synchronized behaviors is possible for electrically coupled neurons and for inhibitory coupling between neurons. In synchronous neurons one typically sees chaotic motion of the coupled neurons. Excitatory coupling produces essentially periodic voltage trajectories, which are also synchronized. We display and discuss these synchronized behaviors using two "distance" measures of the synchronization. 1 Introduction
Individual neurons coupled synaptically to form small functional networks or central pattern generators (CPG) have cooperative properties related to the function they are called on to perform. The cooperative behavior of these coupled cells can be much richer and much more organized than the activity of the individual neurons forming the CPG. The isolated neural cells often exhibit chaotic motions as observed in the characteristics of intracellular voltage measurements (Hayashi and Ishizuka 1992; Aihara and Matsumoto 1986), while some coordination or synchronization of these chaotic activities must be arranged for the directed cooperative behavior of the CPGs to manifest themselves. Our goal in this paper is to demonstrate, by examining the synaptic coupling of model neurons, the interesting broad range of cooperative behaviors that arise when chaotic neurons are connected. Further, we want to understand how it is possible that the potentially very complex behaviors that might transpire when chaotic neurons are coupled in fact reduce in a clean, dynamic way to rather simpler, often well-organized motion, even for nonsymmetrical coupling and nonidentical neurons. Starting from the classical Hodgkin-Huxley (Hodgkin and Huxley 1952) formulation of the dynamics of ionic channels, numerous simpler "reduced" models have been derived and extensively discussed (Rinzel and Ermentrout 1989). To capture both the bursting behavior and the spiking behavior observed in intracellular voltage measurements, one requires a combination of slow and fast subsystems acting in coordination. The fast subsystem is associated with the membrane voltage and rapid conduction channels, typically those due to Na+ and K+. The slow subsystem is critical for bursting behavior on top of which the spikes occur. The differences among the many models with fast and slow subsystems are characterized by the details of how individual neurons, ab-
Synchronized Action of SynapticallyCoupled Chaotic Model Neurons
1569
stracted to such simplified descriptions, produce spikes and bursts, but each contains the main qualitative behaviors seen in the laboratory. Here we are primarily concerned with the manner in which these neurons act in a cooperative fashion as a result of their synaptic coupling to each other. In this work we have focused on the rich variety of features that can arise through this coupling. An important goal of our research effort is to determine the dynamic variables of such models that are predictive and have sound physiological grounding. To this end we begin with the analysis of data from an isolated neuron from the lobster stomatogastric CPG, then turn to the analysis of model neurons, and from our analyses subsequently turn to the study of further laboratory experiments. For this we have adopted as our description for individual neurons the rather simple three-degree-of-freedom model discussed by Rose and Hindmarsh (RH) (Hindmarsh and Rose 1984; Rose and Hindmarsh 1985). This model has as dynamical variables the membrane potential, an auxiliary variable corresponding to the various ionic channels, and a slow variable associated with other ion dynamics. The model shows bursting, spike phenomena and chaotic bursting and spiking. All of these neural actions are seen in measurements on neurons in the laboratory (Hayashi and Ishizuka 1992; Selverston 1996). In this paper we concentrate on the synchronized behavior of chaotic RH neurons when they are coupled by the various types of chemical and electrical synapses coupling observed in nature. Both excitatory and inhibitory forms of chemical synapses will be considered. There are two main issues in beginning this analysis: (1) the chaotic property of the individual neuron dynamics and (2) the interpretation of synchronized behavior when two such chaotic neurons are coupled.
2 Individual Neurons
Rose and Hindmarsh (Hindmarsh and Rose 1984; Rose and Hindmarsh 1985) study a variety of model neurons that are cast into a few ordinary differential equations describing the membrane potential and various aspects of ion conduction important for various functions. Some of their models are two-dimensional; however, such models can never exhibit the chaotic bursting and spiking seen in real neurons. In this regard, they may be seen as oversimplifications of actual neurons that contain numerous types and numbers of channels and may have tens of dynamic degrees of freedom. RH also study three-degree-of-freedommodels that, happily, exhibit the broader range of features, including chaotic bursting, seen in laboratory experiments on isolated neurons (Hayashi and Ishizuka 1992). Even these three-dimensional models, however, ignore many ion channels that would, perhaps, be required to extract detailed behavior of any real neuron.
1570
Henry D. I. Abarbanel et al.
In preparation for adopting one of the three-dimensional models of RH for our study of coupling between chaotic neurons, we begin with an analysis of some experimental observations on real neurons. 2.1 Observed Low-Dimensional Behavior of Neurons. We will be emphasizing the synchronization among model neurons that individually exhibit chaotic behavior. To give a rationale for this emphasis, we first discuss the observed complex dynamical behavior of an isolated cell from the lobster stomatogastric (STG) CPG as observed in our laboratories, that is, an LP neuron from a pyloric CPG. Neurons were isolated from their physiological connections with other neurons in the STG systems by either placing dissociated neurons in culture or by pharmacological blockage of synaptic connections. The justification of the chaotic behavior in real time series, in particular neural ones, has the unavoidable difficulty of noise interference. Although we cannot mathematically rigorously prove that the LP neuron possesses chaotic behavior, we intend to provide indications that support that the behavior of LP neuron is related to complex or even chaotic dynamics. In this section we will be able to show that the external or internal source of noise cannot suppress traces of a complex low-dimensional system. To help us in this purpose, we study the dynamics of the LP neuron for different values of a control parameter (the injected current). The data consist of intracellular voltage taken at a data rate of 2 kHz. The sampling time is T. = 0.5 ms. Time series for different values of the current applied to the LP neuron are shown in Figure 1. For each time series we have a total of 10’ data points available for analysis-around 1 minute of real time data. 2.2.1 Attractors arid Noissp. The time series depicted in Figure la, lb, and l c appear periodic with noise that slightly changes the phase of the spikes. It is reasonable to assume that the noise has about the same level for all studied values of applied current, because the conditions of the experiments were the same for these \ d u e s of the current. That means that Fourier power spectrum can give some important information about changing qualitative properties of the time series for different currents and about the onset of chaos. In Figure 2a we display the Fourier power spectrum of quasi-regular spiking behavior (see Fig. l a ) and the spectrum of the spiking, bursting behavior (see Fig. I d ) . The level of the power of irregular pulsation in the second case is 10 times larger than the power level of the noise in the quasi-regular time series with current I = 2 nA (see also Fig. 3a). This means that the second case is either chaotic dynamics or nonlinear amplification of noise. The next step is the discussion of the differences between these two different phenomena. The main feature of dynamical chaos is the existence of instability o f the motion. In the system phase space, that means that the trajecto-
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1571
c (a) I=2nA
(b) I=lnA
(c) I=OnA
1 sec.
Figure 1: Membrane potentials of the LP neuron for several values of the applied current. (a) I = 2 nA. (b) I = 1 nA. (c) I = 0 nA. (d) I = -1 nA. (e) I = -2 nA. ries forming a chaotic attractor are unstable. Except for some particular cases, the action of small amounts of noise does not alter the properties of dynamical chaos in any significant manner. Noise transfers the motion of the system from one trajectory to another, but these trajectories correspond to the same chaotic attractor, which is only slightly changed by noise. Small noise can qualitatively change the properties of the orbits if the system generating the time series is not structurally stable. To test these possibilities, we reconstruct the phase portraits of the observed motion in some embedding space for different values of a control parameter (ap-
Henry D. I. Abarbanel et al.
1572
10”
-
8
lo’*
a, z loQ 0,
0 -
1od
0
1
2
3
5 6 7 Frequency (Hz)
4
8
9
10
0.02 0.01
0.00
0
0.5
1
1.5
2
Frequency (Hz)
Figure 2: (a) Fourier power spectra for I = 2 nA (dotted line) and I = 0 nA (solid line). (b) Power spectra for I = 0 nA. One may observe the sharp peak corresponding to the bursting oscillations and the broader spectrum for spiking oscillations. plied current). These results are depicted in Figure 3. Figures 3a and 3b depict limit cycles with some level of noise. Two important messages are extracted from these figures: first, the system is structurally stable because both limit cycles are topologically equivalent for different values of the current; second, the thickness of the attractors, which corresponds to the level of noise, is not changing. Figures 3d and 3c persuade us that we have dynamical chaos here because, despite the presence of noise, the attractors have a very clear robust substructure, which indicates the exponential instability of the trajectories. The spreading of the fast motions on the attractor appears due to the onset o f chaos not due to noise. Finally, we observe that in this case, chaos is connected with the fast motion, that is, spiking behavior. The pciwer Fourier spectrum for applied current 1 = 0 nA (see Fig. 2b) shows a strong peak, which corresponds to the periodicity of slow motion. This observation is important when we consider the number of points in the time series required for estimation of embedding dimensions and Lyapunov exponents.
2.1.2 Analysis of Obscrz7ed Data. To establish which of these alternatives to choose, we must create from the observed voltage measurements ZJOZ)= v ( t <+~n ~ , a) coordinate system in which the degrees of freedom
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1573
-1 1
Figure 3: State-space reconstructions from the isolated LP neuron for different values of the current. They have been obtained applying singular value decomposition to a time-delay state space reconstruction using a time delay of 5 msec, which is 10 times the sampling time. This rotates and scales the original coordinates so that the fast spiking motion takes place in the plane x-y and the slow, bursting motion moves along the z-axis. (a) I = 2 nA. (b) I = 1 nA. (c) I = 0 nA.
Henry D. I. Abarbanel et al.
1574
Figure 3: Coizfimwd. (d) 1 = -1 nA. (e) 1 = -2 nA.
are unfolded from their projection on the observation axis of intracellular voltage. If the system is low-dimensional, a few independent coordinates made from z’( n ) and its time lags t i ( I I kT) = zi( to + [ n + k T ) T , ) will suffice to unfold the geometric structure typical of a nonlinear dynamical system (Abarbanel et al. 1993; Abarbanel 1995). If the observed time sequence is chaotic, this geometric structure, the attractor of the neuronal system, must be embedded in a coordinate system with dimension three or greater. This means we need to construct vectors in some d-dimensional space within which the neuron moves in time. The structure is parametrically labeled by time, which is discrete because of the realities of the measurement process. The vectors
-
y ( n ) = [ v ( n ) . v ( n+ T ) .. . . .z(n
+ ( d - 1)T)j
(2.1)
provide a workable coordinate system in which to unfold the projection of the multidimensional dynamics onto the voltage axis. An initial question that must be answered has to do with the independence of the components of the d-dimensional data vectors y(n); a nonlinear answer
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1575
to this question is contained in the average mutual information (Abarbane1 1995; Fraser and Swinney 1986) between components of the vector. Considering the measurements v(n)and v(n+ T ) as two sets of observations A = { a , } = {v(n)}and B = {b,} = { v ( n + T ) },the average mutual information I ( T ) , in bits, between these two sets as a function of the time lag T is defined by Fraser and Swinney (1986) and Gallager (1968)
where P A ( a , ) is the normalized distribution of measurements from set A or the v(n),P B ( ~ ,is) the normalized distribution of measurements from set B or the v(n T), and P A B ( a , . b,) is the normalized joint distribution of the two sets of measurements. For the data from the isolated neuron taken from the STG CPG whose time traces were seen in Figure lc, we have in Figure 4a the average mutual information I ( T ) . It has become a useful and workable prescription to select the time lag corresponding to the first minimum in I ( T ) to be for the construction of the data vectors y(n). From the figure we see that the first minimum is at T = 9 ms. To determine the number of coordinates required to unfold the geometry of the neuronal attractor, we need to ask in what dimension dE used in the data vectors y(n) we no longer have overlaps of strands of the orbits because of projection from a higher dimension. In other words, we wish to know when a data point y(n) has neighbors associated with dynamics rather than with an incorrect and too small choice of dimension in which to view the data. The method of false nearest neighbors (Kennel et nl. 1992; Abarbanel et al. 1993; Abarbanel 1995) directly answers this question by inquiring when the nearest neighbor of each data point y(n) in the full data set remains a nearby point when the dimension of the data vector is increased. When this number of true neighbors is 100 percent, we have unfolded the geometry of the dynamics. Additional dimensions for the data space are not required. This criterion is implemented by asking about the percentage of false nearest neighbors; in Figure 4b we see this statistic as evaluated for the time series observed for the intracellular voltage in the isolated STG neuron. We see clearly that at dimension four or five, the number of false neighbors is nearly zero, and remains nearly zero as dimension is increased. A close examination of the residual percentage of false nearest neighbors shows it levels off near 0.75'/0, characteristic of slightly noisy measurements. This residual noise level is consistent with the environmental status of the experiment. The analysis has thus established that the dynamics of the isolated neuron can be captured in four dimensions. This is a global statement. It may be that the dynamics actually lies in a lower dimensional space, but because of the twisting and folding of the trajectory due to the nonlinear processes, the global space required to undo this twisting and folding
+
Henry D. I. Abarbanel et al.
1576
6 300
5 200 a 10 0
00
1 2 3 4 5 6 7 8 9
15 (dl
0 350
0 l o 5
0 340 0.0
'.n 0 330 0 320
0310 10
-05 -1 0
30 50 70 90 Local Dimension,d
0
Dimension
Timelag (msec)
-1 5
1
p-i----.-' I-------{
-
l-4
00
100
200
300
400
Time (sec)
Figure 3: ( a ) The average mutual information as defined by equation 2.2 for the observations of intracellular voltage in a single neuron from the lobster STG CPG. The first minimum of this is shallow and is located at T = 9 ms. (b) Global false nearest neighbors for the intracellular voltage observations on a single neuron from the lobster STG CPG. The percentage of false nearest neighbors essentially reaches zero at global embedding dimension d r = 4. There is some residual in this statistic for dimensions greater than this reflecting highdimensional contamination (i.e., noise). (5) Local false nearest neighbors for the intracellular voltage observations on a single neuron from the lobster STG CPG. The percentage of bad predictions PK becomes independent of local dimension and of the number of neighbors used t o make local predictions at dL = 3. (d) Lyapunov exponents for the single LI' neuron from the lobster STG CPG. We worked in d, = di. = 3 and used time lag T = 9 suggested by average mutual information. The system exhibits one positive Lyapunov exponent, one zero exponent indicating that differential equations govern the data, and one negative exponent.
is larger than the local dynamics in this space. To establish if this is the case, we utilize the test of local false nearest neighbors (Abarbanel a n d Kennel 1993; Abarbanel 1995), which asks what is happening locally o n the attractor a n d does so in a space larger than or equal to the dimension established in the global false nearest neighbors test. Here this means
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1577
looking in global dimension five or greater to see what local dimension of the dynamics captures the variation of the data. We have examined the STG isolated neuron data in global dimension dE = 12 to give us a long region of dimensions in which the local false neighbors can show independence of dimension and of number of neighbors. We know that for any global dimension greater than five or six, the number of global false neighbors is zero, so any working dimension for local false neighbors greater than this will do. In practice, it is useful to choose the working dimension somewhat larger than the known unfolding dimension. Then we ask, as a function of local dimension dL and number of neighbors Ng,how well a local polynomial predictor from the neighborhood around data point y(i ) produces the observed points in the neighborhood around y(i 1). When the number of false predictions PK as shown in Figure 4c becomes independent of dL and NB,we have determined a local dimension for the dynamics. In Figure 4c we see that d~ = 3 is selected by the data. This implies that we can make models of the behavior of this isolated neuron with three degrees of freedom. (More details on local dynamical dimension can be found in Abarbanel 1995 and Abarbanel et al. 1993.) A final question we will address concerning the possibility of chaos in the observed behavior of the isolated STG neuron regards the global Lyapunov exponents (Abarbanel et al. 1993; Abarbanel 1995; Oseledec 1968),which characterize the stability of orbits when slight perturbations are made to them. Chaotic behavior is established by the presence of one or more positive global Lyapunov exponents. In Figure 4d we show the Lyapunov exponents for the isolated STG neuron as a function of the number of time steps. From this we can see that there is one positive global Lyapunov exponent for this neuron, one zero exponent that tells us that ordinary differential equations govern the behavior of this system, and one negative exponent. A fractal dimension (Abarbanel et al. 1993; Abarbanel 1995) can be established from these global exponents; it is = 2.75. With these analyses in place, we have understood that the dynamics of this neuron are chaotic. The key element in this was our evaluation of the global Lyapunov exponents of the dynamics represented by the measured intracellular voltage trace in the reconstructed phase space using time delays of the measurements. A positive global Lyapunov exponent is the hallmark of chaotic behavior in a dynamical system (Abarbanel 1995).
+
2.1.3 Chaotic Neurons. The evidence we have presented for chaotic behavior in observed neuron activity is another example of low-dimensional chaos in neurons, which goes along with many other confirmations of this phenomenon. Hayashi and Ishizuka (Hayashi and Ishizuka 1992) describe in detail a series of experiments on a molluscan pacemaker neuron, which shows that when certain levels of current are injected into the neuron, chaotic-appearing behavior is seen. The analysis of that ex-
1578
Henry D. I. Abarbanel et al.
periment used tools such as Poincare sections and phase-space portraits, which are convincing but not quantitative. The observations of Hayashi and Ishizuka were on isolated neurons as in the observations we report here. Mpitsos et al. (1988) also present evidence for chaotic activity in Pletirobranchaea californica using phase-space portraits and a computation of the correlation dimension from their observations. The neurons in the study of Mpitsos e f a/. were not isolated but were part of intact circuits. We agree with Hayashi and Ishizuka (1992) in their conclusion that chaotic oscillations in neurons are expected to be a normal state of neural activity. Indeed, when one considers the genericity of chaos (Guckenheimer and Holmes 1983) in dynamical systems described by three or more ordinary differential equations coupled with the numerous ion channels operating in neural behavior, one should anticipate chaotic oscillations as the normal state of activity. 2.2 Individual Rose and Hindmarsh Model Neurons. The model neuron we adopt here is familiar and has been explored extensively in the papers of Rose and Hindmarsh (Hindmarsh and Rose 1984; Rose and Hindmarsh 1985) and many subsequent analyses of their models. We will use a three-variable model that describes in dimensionless units the membrane potential s(t ) and auxiliary variable called y( f) representing a set of fast ion channels connected with aspects of potassium and sodium transport, and finally a “slow” variable z( t ) , which captures the slower dynamics of yet other ion channels. It is not our intention to repeat in any detail either the RH analysis and that of others in establishing the basis of this model nor to repeat the analyses of others on the features of this model as a description of neuron activity. This model generates time series that look like the time series of an isolated LP neuron (Fig. lc), and its strange attractor has the same topology as the one depicted in Figure 3c. Instead, our starting point is that this model contains the appropriate mix of slow and fast dynamics to describe adequately the bursting and spiking behavior of observations in neural systems, and we shall discuss at some length in subsequent sections the interesting and fascinating phenomena that transpire when these model neurons are coupled together. First we establish some general aspects of the RH model that satisfy the differential equations
(2.3) where the two functions ois) and t:(x) are determined from empirical observations on voltage current relations. The choice suggested by Rose
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1579
and Hindmarsh varies with the system they are describing in detail, but the following are conventional: 4 ( x ) = ax2 - x 3 , $ ( x ) = 1- b 2 , (2.4) where a = 3 and b = 5. Similarly the linear appearance of the membrane voltage in the governing equation for the slow variable z ( t ) is discussed at length by Hindmarsh and Rose. For some situations they choose an exponential dependence on membrane voltage for this term, but we have settled on the simpler linear description. It is our experience with this system that although details do depend on the functional choices, the basic features operating in the communication between and synchronization of neurons are not dependent on much beyond the appearance of slow and fast subsystems as featured in the model we adopt. The other parameters in the equations are chosen in our calculations as the injected current I = 3.281, the voltage c, = -1.6, the scale of the influence of the membrane voltage on the slow dynamics S = 4.0, and the time scale for the slow adaptation current Y we choose as Y = 0.0021. Little, if anything, in our analysis of synchronized behavior of these model neurons depends on the specific values of these parameters. This choice does result in chaotic behavior. To see this we display in Figure 5 two samples of the same time series from the solution of these equations. The equations were solved with a very fine time step using a fourth-order Runge-Kutta scheme. This solution oversampled the waveforms but ensured that all interesting variation in the dynamics would be represented. The output from the solution was desampled to produce a dimensionless time step At = 1.0 to produce the displays in Figure 5. Many key features of this model are captured in the simple properties of the vector field (the right-hand side) of equation 2.3 in the case when z ( t ) is ignored. We then have three equilibrium points and a stable limit cycle on the ( x ( t ) , y ( t ) )phase plane. These are shown in Figure 6. The stable separatrix of the equilibrium saddle point, labeled RS for resting state in Figure 6, is the basin boundary for the limit cycle. When one changes the initial condition for this reduced model or equivalently adds a short impulse through the synaptic current I, it is possible to bump the system behavior from one attractor, namely, the stable equilibrium point, to the other attractor, namely, the stable limit cycle. If we hold the slow variable z at zero, then when the model neuron is properly triggered, it can enter an indefinitely long period of repetitive firing; this is just the limit cycle behavior. When the slow dynamics is turned on, however, the repetitive firing can cease, and the neuron will return to its resting state. This resting state corresponds to the saddle node RS. After a relaxation time of order the system may move back to z z 0, and the repetitive firing will resume. As it resumes, depending on the system parameters, one will see regular firing, namely, periodic motions, or chaotic time traces.
t,
Henry D. I. Abarbanel et al.
1580
3 Synaptic Coupling of Two RH Neurons 3.1 Electrical Coupling. We now turn our attention to the behavior of two of our model neurons when they are coupled. Three kinds of coupling are of interest here. The first is a simple electrical coupling that treats the channel between the neurons as a wire with no structural
1
c
-ldx-
1.4
c
0.9
.c
(u +
a" 0)
I
0.4
c
-0.1
f
-0.6
f
-1.1
!!I
I
I
1.9
-i -I c
."
500.0
h
c
x.-ld
2 E
3500.0
4500.0
1.4
0.9 0.4
-0.1
a
-0.6
2
-1.1
E
2500.0 Time
1.9
Q,
!!. 9
1500.0
-1.6 I
I; I
1000
2000 Time
Figure 5: (a) Times series from our model neuron, equation 2.3, with parameters I = 3.281. S = 4.0.c, = -1.6, and r = 0.0021. (b) Blowup of part of the previous time series from the RH neuron.
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1581
YI
I-dx = o
Stable Equilibrium Point
Figure 6: The nullclines of the RH model.
properties. We model this electrotonic coupling as
(3.1) These are two identical model neurons of the form discussed above coupled with a parameter E , which is a conductance for the "wire" connecting them. The quantity ~ ( tis) gaussian, white noise which has zero
Henry D. I. Abarbanel et al.
1582
mean and RMS amplitude (T. We add this noise term, which we restrict to a very small amplitude, for two reasons: 1. All laboratory measurements are noisy. To provide a simple source for this continual perturbation of our coupled systems, we have placed these nondeterministic o r high-dimensional fluctuations in the coupling mechanism among the chaotic neurons. 2. When one couples together model neurons with three degrees of freedom each, the total possible phase space can become both quite large and rich in fine structure and details of basin boundaries in phase space, which, since all measurements are really noisy, we have no chance of observing in the laboratory. To smooth out these complex details, we attribute some noise to the coupling among neurons. We could have attributed the noise to the individual neurons themselves and accomplished essentially the same goal.
This electrical coupling has been well studied in the physical sciences literature (Fujisaka and Yamada 1983; Afraimovich et al. 1986; Rulkov ef 01. 1992), and it is known that it often possesses a submanifold of the full six-dimensional phase space where
s i ( t )= s z ( t ) . y l ( t ) = !/zit). and z l j t ) = z?(f).
(3.2)
On this submanifold we clearly have identical chaotic oscillations of the two neuron oscillators, for on this manifold the coupling term is precisely zero and the individual dynamics is just the same as if the neurons were not coupled at all. When the submanifold is stable, that is, it is an attractor, the neurons are synchronized yet chaotic. We can show that the synchronization of chaotic motions will occur with certainty if the coupling F is strong enough. To see that, introduce the difference variables for the coupled system ZJif)
=
u(ti
= y1(ti
74t) =
,I-l(tj - . Y z ( f ) . ~
Zl(fi -
y?iti
(3.3)
Z?(f).
representing the distance from the submanifold. When r / ( t ) equations governing u(t ) , i i ( t ) , and iu( t ) are
r7o(t) + r S z J ( t ) . with a = 3.0, and b = 5.0.
=
0, the
(3.4)
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1583
If we consider the function
v(t)’ 2nu(t)2 w(t)’ L ( t ) = -+ -+ ___ 2 b2 2rS ’
(3.5)
then from the equations of motion we find
dL0 dt
=
-
p(t,t)v(t)’
-
[ 8 U ( f ) - b2v(f)f 4bU(t)(xi(t)4-X 2 ( f ) ) ] ’ 16ad2
with
(3.7) When p(t,t ) > 0, then L ( t ) will be a Lyapunov function for the coupled system of neurons. In this case both the fixed point ( v ( t ) > u ( f ) . w (= t)) (0.0.0) and the submanifold above will be stable for any initial condition. This translates to the statement that synchronized motions are ensured when
In the Appendix we show that all trajectories in the full six-dimensional space of the coupled neurons move in a bounded domain of the full space and cannot depart from it. This means that for large enough E , the synchronization condition is certainly satisfied. Now we look at some numerical evidence for synchronized behavior of these model neurons with electrotonic coupling. We set the root mean square (RMS) level of the white noise at o = 0.005. To exhibit the synchronization, we introduce two useful measures of the “distance” between the neuron activity. In these measures of distance, we concentrate on the connection between the membrane potentials xl(t) and x2(t) in the neurons. Since the neurons are identical, it is sufficient to examine this connection. 3.1.1 “Distances”Between the Coupled Neurons. The first distance measure is essentially the RMS difference between the x l ( t ) motion, and the x 2 ( t ) motion, with the difference that we recognize from the outset that a shift in the timing of the chaotic motions of the two neurons may occur due to the coupling. To allow for this, we define the average distance D(r. E ) between the two neural behaviors as
o’(~, E) =
1 Ns
-
N-
C (xl(k) k=,
-
(3.9)
1584
Henry D. I. Abarbanel et al.
where T is the possible time shift needed to "align" the two neurons and Ni is the total number of samples. To distinguish between the bursting and spiking motions and examine the synchronization in the bursting alone, we delimit the values of x,(f) by replacing any values greater than -1 with -1. This eliminates the spikes and retains the bursts in the membrane potential. A high-pass filtering of the data would achieve the same end. When this delimiting is employed, we call the average distance the bursting average distance D B (T . 6 ) . In characterizing the synchronization, we have examined the D ( r . c ) and DB( T. t ) as a function of the coupling c at that value of T for which each distance measure is minimum. This is nothing but tracking along the valley of minimum values of D ( T . c ) or DB( T . F ) and labeling the location along that valley by c . In Figure 7a we show D ( r,,,,,,.(), which is the RMS distance as defined above for the time shift T,,,,,,, which at fixed coupling c yields the minimum value of D( T . f ) . It is clear that for c just a bit bigger than 0.5 and in a region about c z 0.2, the neurons appear synchronized. In Figure To we have DB( T,,,,,,. F ) , which exhibits more or less the same characteristics as D ( T,,,,,,. c ) but also suggests that some synchronization among the slow or bursting motions of the neurons may be appearing near F =Z 0.04. A third characteristic is achieved by looking at the time shifts r,,,,,,, which minimize D B ( T .f ) at each F . This function T , , , , , , ( ~ ) is shown in Figure 7c. This reveals an interesting new possibility for the synchronized neurons. For couplings in the neighborhood of F =Z 0.2, we see that synchronization of the bursts is reached with a nonzero T,,,,,,, which is nearly the same for a range of F . The synchronization for other values of f is reached with T,,,,,, =Z 0. For c x 0.2 this suggests that the electrically coupled neurons are synchronized but out of phase with each other. This is quite different from the synchronization on the submanifold discussed above, which we know is reached for large 6. The same phenomenon for "limit-cycle" neuron models has been discussed over the past few years (Sherman and Iiinzel 1992; Han et nl. 1995). Here we have a more general case: regularization of chaos and antiphase locking for electrically coupled chaotic oscillators. Looking now at the time traces of s1 ( f ) and sz(t) for f FZ 0.2 we can see in Figure 8 visual evidence of the various kinds of phase synchronization. Figure 9a shows the out-of-phase synchronization that occurs when t =z 0.2. In Figure 9b we have c = 0.4 and see evidence of a partial synchronization of the two coupled HR neurons. When we raise c to 0.8, as in Figure 9c, we see full synchronization of the two model neurons. 3.1.2 Mutual lnjioririntion Between Ti00 Coiipled Neurons. The key question we are addressing in this paper is the level of synchronization between two chaotic neurons in order to organize them for directed collective behavior. To examine synchronization from this point of view, we
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
16.0 14.0 h
12.0 f
-;
(a)
1585
:
E
3.0 I
..
I
. I
E
;*
150.0
(c)
e
O.OO.0
0
0.2
E
Figure 7: (a) The distance statistic, equation 3.9, for electrically coupled RH neurons. The RMS noise level introduced into the coupling is 0 = 0.005. Near t z 0.2 and for E 2 0.5 the model neurons are synchronized. (b) The bursting distance statistic for electrically coupled RH neurons. The RMS noise level introduced into the coupling is CT = 0.005. Near t N 0.2 and for t 2 0.5 the model neurons are synchronized. (c) The time at which the distance statistic, equation 3.9, is a minimum as a function of E. The synchronization near 6 RZ 0.2 apparent in Figure 12 requires a large time delay and is antiphase synchronization. The synchronization for t 2 0.5 is in phase.
now evaluate the average mutual information between measurements of the membrane potential in the first neuron xl(t) and the membrane PO-
Henry D. I. Abarbanel et al.
1586
-
3
2.0
m
1.0
a
g
0.0
D
-1 0
3
-20
e E
500 0
0.0
10000
Time
-
1
20
C
m
10
0.
g
00
=5
-1 0
e Q
-20 00
1000 0
500 0
Time 20
.C
m
10
a
g
00
m
-1 0
E
-20 00
3
500 0
1000 0
Time
Figure 8: (a) Membrane potentials .XI ( t ) and sz(t ) from electrically coupled RH neurons for f = 0.2 The synchronization is antiphase. (b) Membrane potentials r l i t ) and xz(t) from electrically coupled RH neurons for F = 0.4 The synchronization is not complete but nearly in phase. (c) Membrane potentials s l j t ) and ,I-:( t 1 from electrically coupled RH neurons for f = 0.4 The synchronization is complete and in phase.
tential in the second neuron 1 2 ( t t T ) at some time lag T between the measurements. To evaluate the average mutual information from the measurements x l ( t ) and xZ(f),we bin the data into M bins by defining discrete variables s l ( r z ) and r r ( r r ) through
s , ( n )< -2 then -2
+
k-1 4<. M-l ~
x , ( n )< -2
- 4-M k- 1
then
s,(rii
= Ii,,
S,(II)
= Ilk.
s , ( n )2 2 then r , ( n )
=
k
=
Iih,cli.
1 . .. . . M
-
2
(3.10)
where h,).h , , , . . . 1 1 are~ any ~ set~ of ~ M “letters” designating bins for the values of the x,(u).Using this binning of the data for membrane potential
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1.oo cn 0.80 A 30.60
a
0.40 *&+.*% 5.- *-*a p =0.20 : &.&I-
0.08
1587
.OO
.. I
0.20
0.40
0.60 0.80
1.00
o . 6 ~ 0.80
,.bo
E
100.00
h
w
W
O.O8.k
o.20
o.ko &
u)
s
c
0.02 0.00 f: -0.02
6
a c
-0.04 -0.06
2 -0.08 g -0.10 J
-0.18 &
Figure 9: (a) The average mutual information at the time delay where it is a maximum as a function of the coupling E for electrically coupled RH neurons. The average mutual information is normalized by the entropy of the individual ~ ( 6the )average mutual information between neurons. (b) The time ~ ~ ~at which electrically coupled RH neurons is a maximum. The antiphase synchronization for e = 0.2 and the in-phase synchronization for c 2 0.5 are revealed here. (c) The four largest Lyapunov exponents for the electrically coupled RH neurons as a function of the coupling strength E . These are evaluated directly from the equations of motion. When we have antiphase synchronization near < zz 0.2, the system is not chaotic since all A, _< 0. For c 2 0.5, the coupled system is chaotic as one of the exponents is positive.
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
15.0,
I
I
.
.
I
I
I
, .
1593
,
O.0.0 W 0.2
E.
0
E
200.0
2 !
150.0
g 100.0 50.0 0 E
Figure 11: (a) The distance statistic for inhibitory coupling between RH neurons. The RMS noise level introduced into the coupling is u = 0.005. Near F N 0.17 and for E 2 0.8 the model neurons are synchronized. (b) The bursting distance statistic for inhibitory coupling between RH neurons. Near E N 0.2 and for t 2 0.8 the model neurons are synchronized. (c) The time at which the distance statistic is a minimum as a function oft. The model RH neurons are coupled by inhibitory synaptic coupling. The synchronization near F N 0.2 is in phase and for E 2 0.8 is antiphase. (d) The average mutual information at the time delay where it is a maximum as a function of the coupling F for inhibitory coupling between RH neurons. The average mutual information is maximal for the inphase synchronization, which occurs for E > 0.8 and for t N 0.2. (e) The time ~ , , , ~ ~at ( twhich ) the average mutual information between RH neurons coupled with inhibitory synaptic coupling is a maximum. The antiphase synchronization for t 2 0.8 and the in-phase synchronization for t N 0.2 are seen here.
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1589
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 &
Figure 10: (a)The "phase" diagram for inhibitory coupling between the coupled RH neurons as a function of the coupling strength F and the time delay r, in the synaptic coupling. The four labeled regions correspond to different types of synchronization illustrated by the time traces in the next figures. to a time delay in the action of one neuron on another. We choose to represent this as a response in the membrane potential of one neuron that is delayed and subject to a threshold over which the potential must rise along with that delay. In the equation for the membrane potential of neuron 1, x1( t ) , for example, we will add a response associated with the behavior of neuron 2's potential x2(t) of the form
where, as above, t is the strength of the coupling and q ( t ) is a very small zero mean noise term with RMS magnitude (T. The new ingredients in this coupling term are the thresholding associated with the Heaviside function Q ( w ) ,which is unity for w > 0 and vanishes for w < 0. In addition, we have a reverse potential V,,which tells us the magnitude of the response to the threshold, and we have a threshold X over which the other neuron's membrane potential must have rise at a time delayed
Henry D. I. Abarbanel et al.
1590
...
20
B
1
10
-
c m
E e
10
c m
a 0
00
e
00
D
E
2
,/I1,
E
I
-10
-2 0
00
2000
4000
10
20 00
6000
2000
Time
20
4000 Tlme
6000
20
-.-
3
10
10
0) c
Q 0
00
F
-1 0
I -1 0
m
00
E,
@I
(dl
-20
00
1000
2000
Time
3000
400
-2 0 00
1000
2000 3000
4000
Time
Figure 10: Coittirii~d. (b) Membrane potentials .rl(t) and x ~ ( tfrom ) RH neurons with inhibitory coupling in Region I. T ~ .= 4.0. The synchronization is nearly complete and in phase. (c) Membrane potentials r l ( t ) and xz(t) from RH neurons with inhibitory coupling in Region 1’. r, = 4.0. The synchronization is complete and in phase. (d) Membrane potentials I,(/)and x Z ( t ) from RH neurons with inhibitory coupling in Region 11. T<-= 4.0. The synchronization is complete and antiphase. (e) Membrane potentials 11 ( i ) and s2( t ) from RH neurons with inhibitory coupling in Region 11’. T~ = 1.0.The synchronization is complete and antiphase. by r,. The response to potential ~ , ( f ion the part of xr(f) will simply have 1 -+ 2 in the coupling term. Inhibitory coupling of the neurons is associated with a reverse potential V, > 0, while we achieve excitatory coupling by choosing V , = 0.0. The threshold potential we choose as X = 0.85 in each case. This section considers the result of inhibitory coupling. We have the six differential equations among the membrane potentials and the fast and slow auxiliary variables:
Synchronized Action of SynapticallyCoupled Chaotic Model Neurons
9!L@ = dt
dyzo dt
$(Xl(t))
= Ijl(XZ(t))
1591
- y1(t).
-
y2(t)
(3.15)
and we use the parameters (T = 0.005. X = 0.85, and c, = 1.6 in our work here. V, = 1.4 for the inhibitory coupling. The HR neurons we have coupled in this time-delayed fashion are identical, as embodied in our totally symmetric couplings. We shall not consider asymmetric couplings or coupling differing neurons in this paper, but we plan to return to these cases. Although there are now six ordinary differential equations representing the coupled behavior of two HR neurons because of the time delay involved in the equations, there are now, using the usual description of degrees of freedom in differential equations, an infinite number of degrees of freedom. The time delay leads via Taylor series of expressions such as x1 (t - rc)to an infinite number of derivatives of x1 ( t ) appearing in the equations. This means that the phase space or state space of these coupled systems could be very large indeed, but we demonstrate below that the phase space occupied by the solutions to these coupled neurons is in fact quite small. We have examined the solutions to these coupled equations using the parameters just mentioned, as well as the parameters used above in our discussion of electrical coupling. We are unable to give the same argument via a Lyapunov function that for large enough coupling f we will have synchronization between our identical HR neurons, but synchronization does occur, and we have uncovered it by numerical work. The qualitative aspects of this are displayed in Figure 10a, which is a kind of phase diagram for this system. We indicate the different behaviors seen as we vary both the coupling strength E and the time delay -rc. To illustrate further the typical behaviors found in the various regions of (-rC,c) space, we show in Figure l l b time traces from xl(t) and x*(t). The nearly synchronized and nearly in-phase behavior of the two neurons varies little with E within the region denoted I in Figure 10a. The behavior in Region I’ is shown in Figure 1Oc. Here the two neurons are completely synchronized and fully in phase. Region I1 is typified by the time traces in Figure 10d. In this region the neurons are completely out of phase but clearly synchronized. Finally Region 11’, as seen in Figure 10e,
1592
Henry D. I. Abarbanel et al.
shows a slight variation on Region I1 behavior with the first spiking of neuron 1 the last synchronized connection with the spiking of neuron 2, which ends at that point. The distinction between these regions is qualitative only, but these example time evolutions serve to demonstrate the wide range of synchronization phenomena one encounters and which we suggest one may look for in the laboratory. Next we examine data from the model neurons with inhibitory coupling at a fixed time delay, r, = 4, but as a function of coupling strength t . Figure 10a is the distance measure we introduced above, and we can see synchronization based on this statistic in the neighborhood of f z 0.2 and for f 2 0.8. For other values of there seems to be little synchronization. The bursting distance is displayed in Figure 12b and in an approximate sense is telling us rather similar information to the regular distance D ( T ~ ~f )". .Figure 12c has the values of TmIl,(f) for this inhibitory data, and we can see that in-phase synchronization can be associated with t x 0.2, while for 2 0.8 we have out-of-phase synchronization. This is essentially what the time traces shown above are telling us. The information measure of synchronization, displayed in Figure 12d for the inhibitory coupling case, tells us that for both t z 0.2 and f 2 0.8 we have high average mutual information between the neurons. Figure lle, which has the times of maximum mutual information as a function of f , verifies the earlier comments about in-phase and out-of-phase synchronization. Each of the cases studied across the (r,.f ) plane has its own intrinsic interest. Our emphasis here is on the qualitative behaviors seen as we visit regions of this parameter plane. In particular, it is important to note that synchronization both in-phase and out-of-phase occurs across large regions of this parameter space. 3.3 Excitatory Coupling. Excitatory coupling of the two neurons is described by the same differential equations as above, but now V,. = 0.0. In surveying the ( r , .f ) plane we found three typical kinds of synchronized behavior. Each is nearly fully synchronized motion of both neurons, but the regions differ by the nature of the actual bursting, spiking characteristics. In Figure 12a we see the first of these, which is a rather familiar kind of neuronal activity. This kind of synchrony occurs over the range of coupling 0.05 5 c 5 0.1. Figure 12b shows another kind of synchronous behavior, which appears in the range 0.1 5 t 5 0.15. Finally Figure 12c displays synchronized behavior, which results when from 0.15 5 c 5 1.0. In each of these examples the synchrony is in phase, essentially complete, and results in periodic motion of the coupled neurons. The distance and information measures for synchronization bear out these qualitative observations. In Figure 13a we have D ( r m l n . ffor ) excitatory coupling at r1 = 4. Clearly for t > 0.2 the synchronization is complete and in phase. DB(T,,,~".F), shown in Figure 13b, for the same
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
15.0,
I
I
.
.
I
I
I
, .
1593
,
O.0.0 W 0.2
E.
0
E
200.0
2 !
150.0
g 100.0 50.0 0 E
Figure 11: (a) The distance statistic for inhibitory coupling between RH neurons. The RMS noise level introduced into the coupling is u = 0.005. Near F N 0.17 and for E 2 0.8 the model neurons are synchronized. (b) The bursting distance statistic for inhibitory coupling between RH neurons. Near E N 0.2 and for t 2 0.8 the model neurons are synchronized. (c) The time at which the distance statistic is a minimum as a function oft. The model RH neurons are coupled by inhibitory synaptic coupling. The synchronization near F N 0.2 is in phase and for E 2 0.8 is antiphase. (d) The average mutual information at the time delay where it is a maximum as a function of the coupling F for inhibitory coupling between RH neurons. The average mutual information is maximal for the inphase synchronization, which occurs for E > 0.8 and for t N 0.2. (e) The time ~ , , , ~ ~at ( twhich ) the average mutual information between RH neurons coupled with inhibitory synaptic coupling is a maximum. The antiphase synchronization for t 2 0.8 and the in-phase synchronization for t N 0.2 are seen here.
Henry D. I. Abarbanel et al.
1594
Time
Time
2.0
......
I '
.)-I
t
-2.4
1.O
500.0
I
00.0 ~
2000.0
1000 Time
Figure 12: (a) Membrane potentials s1 ( t ) and x?(t ) from RH neurons with excitatory coupling. 7( = 4.0 The synchronization is nearly complete and in phase. For this type of synchronization we have 0.05 5 F 5 0.1. The synchronization results in periodic behavior. (b) Membrane potentials s1 ( t ) and x2( t ) from RH neurons with excitatory coupling. 7'- = 4.11 The synchronization is nearly complete and in phase. For this type of synchronization we have 0.1 5 F 5 0.15. The synchronization results in periodic behavior. (c) Membrane potentials x1 ( t ) and xl(f) from RH neurons with excitatory coupling. 7, = 4.0 The synchronization is nearly complete and in phase. For this type of synchronization we have 0.15 5 t 5 1.0. The synchronization results in periodic behavior.
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1595
coupling shows much the same as D ( T ~ ~t). , , , r,i, confirms the in-phase nature of the synchronization for f 2 0.2; this is seen in Figure 13c. The information measure of synchronization I(Tmax,t)/S is seen in Figure 13d.
4 Discussion and Conclusions The description of small neural networks behavior like CPGs requires at least the understanding of the cooperative dynamics of minimal neural systems: two synaptically coupled neurons. There are many papers focused on this problem (e.g., Sherman 1994; Sherman and Rinzel 1992; Vreeswijk et al. 1994; Skinner et al. 1994; Han et al. 1995). Nevertheless, all available results related to cooperative behavior of model neurons deal with regular individual dynamics. On the other hand, it is known, and we confirm in our experiments, that the individual dynamics of CPG neurons generates nonregular sequences of spikes on bursts (see Fig. 1). There are some questions raised from this consideration; among them, the main one is to find what mechanisms are responsible for the regular rhythm that achieves muscle control with chaotic neurons in a CPG. The traditional view of electrically coupled chaotic generators is the following: in general, coupled chaotic oscillators generate "hyperchaos," that is, chaos with higher dimension and more complex behavior than for an individual neuron. Only for strong enough coupling can two coupled chaotic generators demonstrate the same chaotic behavior as one. This is chaotic synchronization (Afraimovich et al. 1986; Heagy et al. 1994). What happens when two chaotic neurons are coupled by synaptic coupling? We addressed this question in this paper and found that two synaptically coupled chaotic neurons order each other and demonstrate regular cooperative dynamics in a broad part of parameter space. Before we discuss the origin of this phenomenon, we have to say a few words about chaotic neural modeling. A neuron is a nonlinear, nonequilibrium system with several feedbacks. Due to these feedbacks, which open and close ionic channels in the membrane at the various phases of electrical activity, the state of neuron corresponding to the resting potential may became unstable and the neuron oscillates. Such an oscillator acting on the time scale of characteristic period of electric activity lOHz may be regarded as a dynamical system for which microscopic kinetics acts only as noise. Because the number of different ionic channels is large, it is not really a surprise that real neurons are chaotic oscillators. But the real question is this: Why is it a low-dimensional oscillator? From the mathematical point of view, any generator with noise has an infinite dimension, and the estimation of the dimension depends on the level of description. Nevertheless, when the noise is small, we may distinguish chaotic behavior from noise alone and estimate the number of important degrees of freedom (number of active
-
Henry D. I. Abarbanel et al.
1596
40, 0.0%L
50.0
10.0
O
l i r \ o
~$20.0 30.0 v
E
,
0.8 E
E
200.0 ,150.0 3
gloo.0 2
.J 50.0
Figure 13: (a) The distance statistic for excitatory coupling between RH neurons. The RMS noise level introduced into the coupling is (T = 0.005. For f 2 0.15 the model neurons are synchronized. The synchronized motion is periodic for excitatory coupling. (b) The bursting distance statistic for excitatory coupling between RH neurons. For f 2 0.15 the model neurons are synchronized. The synchronized motion is periodic for excitatory coupling. (c) The time at which the distance statistic is a minimum as a function of f . The model RH neurons are coupled by excitatory synaptic coupling. The synchronization near f 1 0.15 is in phase. The synchronized motion is periodic for excitatory coupling. (d) The average mutual information at the time delay where it is a maximum as a function of the coupling F for excitatory coupling between RH neurons. The phase synchronization occurs for c 2 0.15. The synchronized motion is periodic for excitatory coupling. (e) The time T , , , ~ ~ ( (at) which the average mutual information between RH neurons coupled with excitatory synaptic coupling is a maximum. The synchronization for f 2 0.15 is in phase. The synchronized motion is periodic for excitatory coupling.
;
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1597
m
f
E
n
40 20 n "
1
2
3
a
6
s
7
8
9
1
0
Dimension
0.40 0.35 0.30 0.25 0.20 0.15
:
~
1
2
3
4
;
6
5
8
7
9
10
Dimension
3
g
-1 .o
a
3 -2.00
50
100
150
200
250
Time
Figure 14: (a) Global false nearest neighbors for the voltage time series from the Rose-Hindmarsh model with 1% noise. The percentage of false nearest neighbors reaches zero at a global embedding dimension dE = 4. (b) Local false nearest neighbors for the voltage time series from the Rose-Hindmarsh model with 1% noise. The percentage of bad predictions PK becomes independent of the local dimension and of the number of neighbors used to make local predictions at dL = 3. (c) Lyapunov exponents for the same model time series for dL = 3.
independent variables). To do this we used three different approaches: (1) using the observed data, we reconstructed the attractors and examined its variation with changes of applied external current; (2) using the global and local false nearest neighbors, we calculated the embedding dimension and the local dynamical dimension; and ( 3 )we evaluated the Lyapunov exponents. All these results clearly tell us that we may model our CPG neuron by ordinary differential equations with three or four degrees of freedom and small noise.
Henry D. I. Abarbanel et al.
1598
x
2 1
o -1 - 2 : - I
x-
2 1 0
-1 0.0
500 0
1000.0
1500.0
Time
Figure 15: (a,b) Time series of two different RH models. The time scale for the neuron in (a) is half that for the neuron in (b). In (a) r = 0.0021 and for (b) r = 0.0025. (c) The time course of each of the neurons when they are coupled i n a nonsymmetric fashion with reciprocal inhibitory coupling. They are now synchronized and out of phase.
In order to confirm this kind of result on bursting, spiking systems, we performed the same computations on output from our model RH neurons. We introduced small gaussian, white noise with an RMS level of 1%of the model output into the right-hand side of the RH equations. Then we analyzed these data using precisely the same algorithms utilized for the observed data. The results are displayed in Figure 14. We emphasize that the dimension of a strange attractor and the values of the Lyapunov exponents depend strongly on the existence of noise. For example, the positive Lyapunov exponent calculated for the RH model without noise is smaller than the Lyapunov exponent calculated from the voltage time series from the RH model with 1%noise. The origin of this phenomenon is the following: in the phase space of the RH model, there are strong, unstable periodic orbits in the proximity of the attractor, as is clear from the transient time series. When noise is introduced in the system, these unstable orbits become part of the attractor, which leads to an increase of the average Lyapunov exponent. The same occurs in the embedding dimension calculation.
Synchronized Action of SynapticallyCoupled Chaotic Model Neurons
1599
Of course, the RH model is not complete enough for a correct description of every property of spiking-bursting neurons. For example, as formulated, it does not explain the bursting frequency and amplitude of the rebound potential versus current that we can observe in real data (see Fig. 2): (1) the spiking frequency at the beginning of the bursts increases as the applied current decreases, ( 2 ) the amplitude of the rebound potential decreases with increasing current; and (3) the interburst frequency increases with current. These properties are important because they determine the cooperative behavior of synaptically coupled bursting neurons. It is possible to generalize the RH model (see Rabinovich et al. 1996)to model these features. However, the differences between our new model and the original RH model are not essential when we investigate the cooperative behavior of two identical chaotic neuron models. These simulations do not provide us with quantitative statements but focus on qualitative behavior as the strength of the coupling is modified. It is not pharmacologically easy to adjust gradually the strength of the coupling. Good experimental support of this mathematical approach can be found matching the qualitative behavior (phase to phase, antiphase, and so on) and the characterization of chaotic motion seen in a few experimental points and the numerical calculations. In our opinion, to be able to make qualitative predictions justifies this work. A natural objection drawn from these predictions is that the two model neurons are identical, as is the identical reciprocal coupling. This raises the question: Is this predicted qualitative behavior maintained when this symmetry is broken? To answer this question requires an increase in the number of parameters to be explored. We have not made a systematic survey of the parameters as part of the present work, but have explored an example of two RH model neurons with different internal parameters that are coupled reciprocally by inhibitory connections, and these connections have different strengths. The parameters of the RH neurons are set so that they have a time scale of about a factor of two with respect to each other, and the coupling strengths were €1 = 0.6 and t2 = 1.4 for the two synapses. In Figure 15a and 15b we have the time courses of the individual model neurons, and it is clear they act rather differently. When they are coupled as indicated, they synchronize, out of phase as it happens (see Fig. 15c). This provides clear evidence that nonidentical neurons with nonsymmetric coupling can synchronize. The synchronization explored at some depth in this paper regarding similar neurons symmetrically coupled can now be seen as one case of a general phenomenon. The robustness of the phenomenon is very important if we are to focus on it as the mechanism for coherent cooperative behavior in networks of neurons that might well be different individually.
lhOQ
Henry D. 1. Abarbanel et al.
In this Appendix we show that the stability condition (equation 3.8) is satisfied when the coupling is strong enough. By doing this we prove that all attractors in the six-dimensional phase space of the dynamical system (equation 3.1) are located within some limited domain centered around the origin of the phase space, and therefore the right-hand side of this condition (equation 3.8) for every attractor of the system is less than some finite value. We consider a positive function that has one minimum at the origin of phase space
The time derivative of this for the system (equation 3.1) is
where
One can see from equation A.3 that F ( x ) is positive when 1x1 < E and has negative values for 1x1 2 E . It follows from equation A.2 that in regions far enough from the origin, the time derivative d L l / d t is negative, and therefore all trajectories beginning in this region are attracted by a domain centered at the origin of phase space. Acknowledgments The work of H. D. I. Abarbanel, N. F. Rulkov, and M. 1. Rabinovich was supported in part by the U.S. Department of Energy, Office of Basic Energy Sciences, Division of Engineering and Geosciences, under contract DE-FG03-90ER14138 and DE-FG03-96ER14592, and in part by the Office of Naval Research (Contract N00014-91-C-0125). R. Huerta is supported by a (M. E. C.) Spanish Government Fellowship. A. I. Selverston and P. F. Rowat are supported by National Science Foundation Grant IBN9122712.
Synchronized Action of Synaptically Coupled Chaotic Model Neurons
1601
References Abarbanel, H. D. I. 1995. Analysis of Observed Chaotic Data. Springer-Verlag, New York. Abarbanel, H. D. I., Brown, R., Sidorowich, J. J., and Tsimring, L. 1993. The analysis of observed chaotic data in physical systems. Reviezus of Modern Physics 65, 1331-1392. Abarbanel, H. D. I., and Kennel, M. B. 1993. Local false nearest neighbors and dynamical dimensions from observed chaotic data. Pkys. Rev. E 47, 30573068. Adams, W. B., and Benson, J. 1985. The generation and modulation of endogenous rhythmicity in the Aplysia bursting pacemaker neuron R15. Prog. Biopkys. Mol. Biol. 46, 1 4 9 . Afraimovich, V. S., Verichev, N. N., and Rabinovich, M. I. 1986. General synchronization. lzv. V U Z . Radiopkiz. 29, 795-803. Aihara, K., and Matsumoto, G. 1986. Chaotic oscillations and bifurcations in squid giant axons. In Chaos, pp. 257-269. Manchester University Press, Manchester. Fraser, A. M., and Swinney, H. L. 1986. Independent coordinates for strange attractors. Pkys. Rev. 33A, 1134. Fujisaka, H., and Yamada, T. 1983. Stability theory of synchronized motion in coupled-oscillator systems. Prog. Theor. Pkys. 69, 3247. Gallager, R. G. 1968. Information Theory and Reliable Communication. John Wiley, New York. Guckenheimer, J., and Holmes, I? 1983. Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer Verlag, New York. Han, S. K., Kurrer, C., and Kuramoto, Y. 1995. Dephasing and bursting in coupled neural oscillators. Pkys. Rev. Lett. 75, 3190. Hayashi, H., and Ishizuka, S. 1992. Chaotic nature of bursting discharges in the Onckidium pacemaker neuron. J. Tkeor. Biol. 156, 269-291. Hindmarsh, J. L., and Rose, R. M. 1984. A model of neuronal bursting using three coupled first order differential equations. Proc. R. SOC. Lond. B 221, 87-102. Heagy, J. F., Carroll, T. L., and Pecora, L. M. 1994. Experimental and numerical evidence for riddled basins in coupled chaotic systems. Pkys. Rev. Lett. 73, 352&3531. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London) 107,500-544. Kennel, M. B., Brown, R., and Abarbanel, H. D. I. 1992. Determining minimum embedding dimension using a geometrical construction. Phys. Rev. A 45, 3403-3411. Mpitsos, George J., Burton, R. M., Creech, H. C., and Soinla, S. S. 1988. Evidence for chaos in spike trains of neurons that generate rhythmic motor patterns. Brain Res. Bull. 21, 529-538. Oseledec, V. I. 1968. A multiplicative ergodic theorem. Lyapunov characteristic numbers for dynamical systems. Trudy Mosk. Mat. Obsc. 19,197.
1602
Henry D. I. Abarbanel et al.
Rabinovich, M. I., Huerta, R., Abarbanel, H. D. I., and Selverston, A. I. 1996. A minimal model for chaotic bursting of the LP neuron in lobster. Submitted to Proc. Not/. Acad Sci. Rinzel, J., and Ermentrout, G. B. 1989. Analysis of neural excitability and oscillations. In Methods iil Nmrorial Moddiiig, C. Koch and I. Segev, eds., pp. 135160. MIT Press, Cambridge, MA. Rose, R. M., and Hindmarsh, J. L. 1985. A model of thalmic neuron. Proc. Roy. SOC.Lord. B 225, 161-193. Rulkov, N . F., Volkovskii, A. R., Rodriguez-Lozano, A., Del Rio, E., and Velarde, M. G. 1992. Mutual synchronization of chaotic self-oscillators with dissipative coupling. Int. 1. Btf. C/iaos 2, 669-676. Selverston, A. I. 1996. Experiments on neurons within the STG Central Pattern Generator. Unpublished. Sherman, A. 1994. Anti-phase asymmetric aperiodic oscillations in excitable cells-I. Coupled bursters. B i d . Mafh. B i d . 56, 811-833. Sherman, A., and Rinzel, J. 1992. Rhythmogenic effects of weak electrotonic coupling in neuronal models. Proc. Not/. Sci. U.S.A. 89, 2471. Skinner, F. K., Kopell, N., and Marder, E. 1994. Mechanisms for oscillation and frequency control in reciprocally inhibitory model neural networks. I. C o r y . N t w o . 1, 69. Takens, F. 1981. In D!/mr?iica/Systcvis mid Turbidcrice, W(7riclick 1980. Lrrtirrr Nofrs iri Matlimntics 898, D. Rand and L. S. Young, eds., p. 366. Springer, Berlin. Vreeswijk C., Abbott, L. F. and Ermentrout G. B. 1994. When inhibition not . 313-321. excitation synchronizes neural firing. I. Cornpi. N E I I P O1,
Receixed December 6, 1995, accepted March 26, 1996
This article has been cited by: 1. Xue-Rong Shi. 2010. Bursting synchronization of Hind–Rose system based on a single controller. Nonlinear Dynamics 59:1-2, 95-99. [CrossRef] 2. Joan F. Alonso, Miguel A. Ma��anas, Sergio Romero, Dirk Hoyer, Jordi Riba, Manel J. Barbanoj. 2009. Drug effect on EEG connectivity assessed by linear and nonlinear couplings. Human Brain Mapping NA-NA. [CrossRef] 3. Henry D. I. Abarbanel, Daniel R. Creveling, Reza Farsian, Mark Kostuk. 2009. Dynamical State and Parameter Estimation. SIAM Journal on Applied Dynamical Systems 8:4, 1341. [CrossRef] 4. Marifi Güler. 2008. Detailed numerical investigation of the dissipative stochastic mechanics based neuron model. Journal of Computational Neuroscience 25:2, 211-227. [CrossRef] 5. Nikola Burić, Kristina Todorović, Nebojša Vasović. 2008. Synchronization of bursting neurons with delayed chemical synapses. Physical Review E 78:3. . [CrossRef] 6. Thomas Nowotny, Ramon Huerta, Mikhail I. Rabinovich. 2008. Neuronal synchrony: Peculiarity and generality. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:3, 037119. [CrossRef] 7. Gregory R. Stiesberg, Marcelo Bussotti Reyes, Pablo Varona, Reynaldo D. Pinto, Ramón Huerta. 2007. Connection Topology Selection in Central Pattern Generators by Maximizing the Gain of InformationConnection Topology Selection in Central Pattern Generators by Maximizing the Gain of Information. Neural Computation 19:4, 974-993. [Abstract] [PDF] [PDF Plus] 8. Dmitry Smirnov, Bjoern Schelter, Matthias Winterhalder, Jens Timmer. 2007. Revealing direction of coupling between neuronal oscillators from time series: Phase dynamics modeling versus partial directed coherence. Chaos: An Interdisciplinary Journal of Nonlinear Science 17:1, 013111. [CrossRef] 9. Riccardo Meucci, Francesco Salvadori, Mikhail Ivanchenko, Kais Naimee, Chansong Zhou, F. Arecchi, S. Boccaletti, J. Kurths. 2006. Synchronization of spontaneous bursting in a CO_{2} laser. Physical Review E 74:6. . [CrossRef] 10. Mikhail Rabinovich, Pablo Varona, Allen Selverston, Henry Abarbanel. 2006. Dynamical principles in neuroscience. Reviews of Modern Physics 78:4, 1213-1265. [CrossRef] 11. F. Torrealdea, A. d’Anjou, M. Graña, C. Sarasola. 2006. Energy aspects of the synchronization of model neurons. Physical Review E 74:1. . [CrossRef] 12. Istv?n Z. Kiss, Qing Lv, Levent Organ, John L. Hudson. 2006. Electrochemical bursting oscillations on a high-dimensional slow subsystem. Physical Chemistry Chemical Physics 8:23, 2707. [CrossRef]
13. Gouhei Tanaka, Borja Ibarz, Miguel A. F. Sanjuan, Kazuyuki Aihara. 2006. Synchronization and propagation of bursts in networks of coupled map neurons. Chaos: An Interdisciplinary Journal of Nonlinear Science 16:1, 013113. [CrossRef] 14. M. V. Ivanchenko, G. V. Osipov, V. D. Shalfeev. 2005. Control of synchronous dynamics of chaotic bursts in ensembles of neuron-like elements. Radiophysics and Quantum Electronics 48:12, 934-939. [CrossRef] 15. Wang Hai-Xia, Lu Qi-Shao, Wang Qing-Yun. 2005. Complete Synchronization in Coupled Chaotic HR Neurons with Symmetric Coupling Schemes. Chinese Physics Letters 22:9, 2173-2175. [CrossRef] 16. M. Denker, A. Szucs, R.D. Pinto, H.D.I. Abarbanel, A.I. Selverston. 2005. A Network of Electronic Neural Oscillators Reproduces the Dynamics of the Periodically Forced Pyloric Pacemaker Group. IEEE Transactions on Biomedical Engineering 52:5, 792-798. [CrossRef] 17. Allen I. Selverston. 2005. A Neural Infrastructure for Rhythmic Motor Patterns. Cellular and Molecular Neurobiology 25:2, 223-244. [CrossRef] 18. Wang Qing-Yun, Lu Qi-Shao. 2005. Time Delay-Enhanced Synchronization and Regularization in Two Coupled Chaotic Neurons. Chinese Physics Letters 22:3, 543-546. [CrossRef] 19. Katsuki KATAYAMA, Tsuyoshi HORIGUCHI. 2005. Synchronous Phenomena of Neural Network Models Using Hindmarsh-Rose Equation. Interdisciplinary Information Sciences 11:1, 11-15. [CrossRef] 20. Mikhail Ivanchenko, Grigory Osipov, Vladimir Shalfeev, Jürgen Kurths. 2004. Phase Synchronization in Ensembles of Bursting Oscillators. Physical Review Letters 93:13. . [CrossRef] 21. Nikolai Rulkov. 2002. Modeling of spiking-bursting neural behavior using two-dimensional map. Physical Review E 65:4. . [CrossRef] 22. B Cazelles, M Courbage, M Rabinovich. 2001. Anti-phase regularization of coupled chaotic maps modelling bursting neurons. Europhysics Letters (EPL) 56:4, 504-509. [CrossRef] 23. R. Pinto, P. Varona, A. Volkovskii, A. Szücs, Henry Abarbanel, M. Rabinovich. 2000. Synchronous behavior of two coupled electronic neurons. Physical Review E 62:2, 2644-2656. [CrossRef] 24. Mikhail Rabinovich, Ramón Huerta, Maxim Bazhenov, Alexander Kozlov, Henry Abarbanel. 1998. Computer simulations of stimulus dependent state switching in basic circuits of bursting neurons. Physical Review E 58:5, 6418-6430. [CrossRef] 25. A. SELVERSTON, R. ELSON, M. RABINOVICH, R. HUERTA, H. ABARBANEL. 1998. Basic Principles for Generating Motor Output in the Stomatogastric Ganglion. Annals of the New York Academy of Sciences 860:1 NEURONAL MECH, 35-50. [CrossRef] 26. M.I. Rabinovich, H.D.I. Abarbanel, R. Huerta, R. Elson, A.I. Selverston. 1997. Self-regularization of chaos in neural systems: experimental and theoretical results.
IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44:10, 997-1005. [CrossRef] 27. M. I. Rabinovich. 1996. Chaos and neural dynamics. Radiophysics and Quantum Electronics 39:6, 502-510. [CrossRef] 28. A. V. Gaponov-Grekhov, M. I. Rabinovich. 1996. Overview: Synchronization and patterns in complex systems. Chaos: An Interdisciplinary Journal of Nonlinear Science 6:3, 259. [CrossRef]
NOTE
Communicated by Giinther Palm
On the Capacity of Threshold Adalines with Limited-Precision Weights Maryhelen Stevenson Shaheedul Huq Department of Electrical Engineering, University of N e w Brunswick, Fredericton, New Brunswick, Canada E3B 5A3
The effect of limited-precision weights on the capacity of the threshold Adaline is examined. In particular, an experimental technique is used to determine the capacity of the threshold Adaline for several different levels of weight precision. The results provide some insight into the manner in which the capacity of the threshold Adaline increases as the number of bits per weight increases. As might be expected, the growth in capacity due to an additional bit per weight is greatest when the original number of bits per weight is small, and decreases as the original number of bits per weight increases. 1 Introduction
The threshold Adaline consists of an adaptive linear combiner cascaded with a hard-limiting quantizer yielding a f l binary-valued output. An input pattern, X E R",is mapped to the $1 category if the dot product of X with the Adaline's weight vector, W E R",is greater than 0 and is mapped to the -1 category if the same dot product is less than 0. Capacity is a measure of functionality and the ability to store information. The capacity of an Adaline refers to the expected maximum number of patterns with random class assignment' such that the threshold Adaline will be able to dichotomize the patterns as dictated by their class assignments. It is well known that the capacity of a threshold Adaline with real-valued weights is twice the number of weights (Cover 1965); its normalized capacity, that is, its capacity normalized by the number of weights, is two. As the number of bits used to store each of the Adaline weights decreases, so does the capacity of the Adaline. In this paper, we investigate the effects of limited-precision weights on the capacity of the threshold Adaline in the limit as the number of inputs (and hence weights) grows large. For convenience, we will use the phrase "m-ary Adaline" 'Each pattern is assigned to one of two classes, with probability of assignment to either class being 112.
Neuml Cotnputntion 8, 1603-1610 (1996) @ 1996 Massachusetts Institute of Technology
Maryhelen Stevenson and Shaheedul Huq
1604
to denote a threshold Adaline whose weights are restricted to the m values in the set W,,,,where W,,,= (0.3-1... . . * ( t n - 1)/2} for odd m and W,il = { f l . . . . k ( n r ) / 2 ) for even ni. Using an experimental procedure first proposed by Krauth and Opper (1989), we determine the capacity of the 3-ary, 5-ary, and 7-ary Adalines. These results, together with the experimental2 result of Krauth and Opper (1989) for the case of the 2-ary Adaline and the well-known result for the case of real-valued weights (Cover 1965),provide insight into the nature of the relationship between the normalized capacity of the Adaline and the number of bits used to represent each weight. The paper is organized as follows. The theoretical framework and notation are discussed in Section 2. The experimental procedure is discussed in Section 3. The results are presented in Section 4. Section 5 concludes the paper. 2 Theoretical Framework and Notation
Let T(P . 11) denote a training set that consists of P n-component patterns {X'", . . . . X'"'} together with their corresponding fl binary-valued desired responses, {d"'. . . . . d"')}:
T(P.~
l= )
{ ( X"". d""
)I;;=,.
A threshold Adaline with weights restricted to values in the set W can classify all patterns in the training set as desired, provided there exists a weight vector W E LZ;" such that U't'' . W > 0 for all 1 5 p 5 P where &~IX!P~, The margin size, K , associated with an attempt to classify all patterns in the training set T(P. 1 2 ) as desired, using a threshold Adaline with weight vector W, is defined as: [ J l P
K ( T ( P . HW ) . ) = min I'
(:;ilv Y ) ___-
,
Note that the quantity U'J" . W /// W I/ is the "distance" of the pth pattern from the separating surface3 associated with W, where a negative "distance" implies that the pattern is classified incorrectly (i.e., the pattern lies on the wrong side of the separating surface). The maximum attainable margin size, hnrar, measures the ease with which a dichotomy, T(P.ti), can be implemented (using a threshold Adaline with weights restricted to values in W); it is defined as follows:
~
.__~____
?Krauth and Mezard (1989) were able to obtain the same result using a theoretical approach involving methods of statistical mechanics. 'The separating surface aswciated with W is the hyperplane orthogonal to W, which passes through the origin in R".
On the Capacity of Threshold Adalines with Limited-Precision Weights
1605
A large positive value of K, implies that the dichotomy is easily realized with a threshold Adaline and suggests that there are many choices of W E W” which dichotomize the training set patterns as desired. Smaller positive values of K,, suggest a smaller number of weight vectors that can implement the desired dichotomy; negative values indicate that the desired dichotomy is not realizable by a threshold Adaline with weights restricted to values in W . In a sense, is a measure of the Adaline’s unused storage capacity. The bigger the K,,, is, the more likely it is that the threshold Adaline will be able to store additional information without disturbing the information, T ( P .n ) ,already stored. As additional information is stored, the maximum attainable margin size decreases. The above discussion focuses on measuring K,,~ for a particular training set. It is somewhat more interesting to evaluate the expected maximum attainable margin size with respect to an ensemble of training sets. To do this, let the random variable ICmax(W.n.n) denote the maximum margin size that can be attained when using a threshold Adaline, with weights restricted to values in W , to store the information, T ( P ,n ) , where T ( P ,n ) denotes a randomly selected training set that consists of P = an, n-component patterns together with their class labels. (Note that (t = P / n is the normalized training set size.) We denote by E[IC,,,(W, a , n ) ] the expected value of Km,, with respect to the probability distribution governing the selection of training sets and will use K ~ , ~ a( )Wto. denote the limiting value of E[IC,,,(W, a. n ) ]as n grows large:
The value of a for which K ~ ~ ~ ( W = .0C can V )be interpreted as the normalized capacity (in the limit as the number of inputs grows large) of a threshold Adaline with weights restricted to values in W . This interpretation forms the basis of the experimental procedure described in the next section. 3 Experimental Procedure
The experimental procedure for determining the capacity of the m-ary Adaline consists of two parts. Part I of the procedure results in an approximation for the limit (as n -+ m) of the maximum attainable margin size, ~li,(W,,*, a ) , for several different values of a; Part I1 of the procedure results in an approximation of the normalized training set size, cr, for which K ~ ~ ~ ( W ,= , , ,0,C thus Y ) yielding an approximation for the normalized capacity of the rn-ary threshold Adaline. The details of the procedure are discussed in the following subsections. 3.1 Details of Part I. The goal of the first part of the experiment is to approximate K : ~ ~ , ( W ~for ~ .several C Y ) values of 0. For a given N,
1606
Maryhelen Stevenson and Shaheedui Huq
b
3.2 Details of Part 11. For each set of allowable weight values, W,,,, the approximations of til,,,,(WjlI. ( t ), found in Part I, were plotted versus 'For each (k, several values o f t i are chosen such that P = (?ti is an integer. 'The ii-components of each o f the P input patterns were independently and identically distributed according to a zero-mean unit-variance gaussian distribution. 'The desired responses are randomly assigned and are equally likely to have a value of rtl. 'The number of ti-component tti-ary weight vectors, tti", grows exponentially in Ti. 'The variance of the estimate of E[K,,,,,,:W,,,.(i.t i ) j was calculated as the variance of h:,,,,,(l.t',,,.0 . t i ) divided by the number of training sets, Ni, used in estimating E [ h :,,,,,,i W,,,. (1. ti)]. The variance o f K,,,,,, (Vt',,>. o.ti) was approximated as the average squared difference between the maximum attainable margin size and the average maximum attainable margin size for the N, sets.
On the Capacity of Threshold Adalines with Limited-Precision Weights
1607
(1, and a second-order polynomial was fit to the plotted points. The normalized capacity of the m-ary threshold Adaline was then found as the root of the "fitted" polynomial in the neighborhood of the plotted points.
4 Results
The results of Part I are shown in Figure 1 for the 3-ary, 5-ary, and 7ary Adalines. The error bars above and below each estimate indicate f 3 standard deviations of the estimate (the procedure used to determine the variance of each estimate is discussed in footnote 8.) Figure 2 illustrates the second-order polynomials found for the three cases when the weights of the threshold Adaline were restricted to values in the sets: W3, Wg, and W7. As can be seen from the figure, the normalized capacities of the 3-ary 5-ary, and 7-ary Adalines were found to be 1.16, 1.50, and 1.65, respectively. The normalized capacities found in the previous section, together with the normalized capacity of the 2-ary Adaline (found by Krauth and Opper 1989 to be 0.82) are plotted versus the number of bits per weight in Figure 3. Note that for the m-ary Adaline, the number of bits per weight, b, is given by b = log, m. The function 2 tanh(0.42 * b) has been superimposed on this figure to illustrate that this function provides a fairly good envelope for the data points. It also reflects the fact that the normalized capacity increases to two as the number of bits per weight becomes arbitrarily large. It is, however, not meant to suggest that the capacity should be thought of as a continuous function of b; indeed, it is unclear what, if any, meaning to associate with those values of b that are not equal to log, in for some integer m. As suggested by a reviewer, there may be other functions, defined on the integers, that fit the data points equally well. 5 Discussion and Concluding Remarks In this paper, we have experimentally determined the normalized capacity of a threshold Adaline for various levels of weight precision. The work was motivated by a desire to understand better a possible tradeoff between small networks with high-precision weights and bigger networks with low-precision weights. The results of this work provide some insight regarding the nature of the relationship between the normalized capacity of the Adaline and the number of bits used to represent each weight. In particular, the results clearly indicate that the growth in normalized capacity due to an additional bit per weight is greatest when the original number of bits per weight is small; in other words, the normalized capacity per bit is greatest for a small number of bits per weight. This suggests that a network with many low-precision weights may, in
1608
Maryhelen Stevenson and Shaheedul Huq
(a) 3-ary Adallne a - 2/3
a- 4/3
-0.2'
0.1 0.2 0.3 reciprocal of the number of weights
0.4
(b) 5-aw Adailne
8 -0.2
4
-
0
I
0.1 roclprocal of
.I z o
0.4
€--
A
-
a - 413
a - 3/2 I
O.l;
[-0.1
0.3 weights
____i
030.2-
0.2
ms number 01
a-2
a
-
5/3
< a - 2
-0.2
Figure 1: The average maximum margin size attainable by a limited-precision Adaiine versus the reciprocal of the number of weights, 1 / n , for several values of normalized training set size, 0 . In all cases, the error bars denote +3 standard devia lions.
On the Capacity of Threshold Adalines with Limited-Precision Weights
I 0.5
1609
I
1
1.5 normalized trainingset size = P/n
2
2.5
Figure 2: The expected maximum attainable margin size for large n, versus the normalized training set size, a = P / n , for the 3-ary, 5-ary, and 7-ary Adalines.
fact, be able to store more information than a network with relatively fewer high-precision weights (where the total number of bits required to store all the weights of the network is assumed to be the same for both networks; that is, the product of the number of weights with the number of bits per weight is a constant for both the networks). This possibility will be explored in future work.
Acknowledgments The authors would like to thank the reviewer for helpful suggestions. This work was supported by the Natural Sciences and Engineering Research Council of Canada.
This article has been cited by: 1. M. Stevenson. 1997. The effects of limited-precision weights on the threshold Adaline. IEEE Transactions on Neural Networks 8:3, 549-552. [CrossRef]
1612
Yu-Dong Zhu and Ning Qian
for the observed disparity tuning of these cells remains controversial. Early physiological studies suggested that disparity tuning was created by a retinal positional shift between the left and right RFs of a binocular cell (Bishop et a/.1971; Maske et al. 1984). The shapes of the two RF profiles of a given cell were usually assumed to be identical. Although such a position-shift-based RF description has an intuitive appeal, the main limitation of these early studies is that cells’ RFs were usually mapped manually, and the results were therefore only qualitative. Quantitative mapping of binocular RFs was performed relatively recently by Freeman and collaborators (Ohzawa ef al. 1990; DeAngelis ef nl. 1991) using the automated reverse correlation technique (Jones and Palmer 1987). These studies indicate that binocular RFs of a simple cell in the cat primary visual cortex can be well described by two Gabor functions-one for each eye. (A Gabor function is simply a product of a gaussian envelope and a sinusoid.) It was found that the left and right RFs of a cell often have somewhat different shapes and that this shape difference can be easily accounted for by letting the two Gabor functions have the same gaussian envelopes (on the corresponding left and right retinal locations) but different phase parameters in the sinusoids. The phase parameter difference creates a shift between the two sinusoids within their registered gaussian windows, and this shift generates disparity sensitivity for the cell. The phase-parameter-based RF description of Freeman ef a / . has recently been questioned by Wagner and Frost (1993) based on their discovery of a so-called characteristic disparity (CD) in some cells recorded from the visual Wulst of the barn owl (see also Pettigrew 1993). For a given cell, Wagner and Frost first obtained its disparity tuning curve using spatial noise stimuli. There is usually a main peak in the tuning curve flanked by smaller side peaks. They then recorded from the same cell with sinusoidal gratings of various spatial frequencies and obtained a family of disparity tuning curves, one for each grating frequency. Each of these grating tuning curves is periodic, with a period equal to that of the stimuli. The interesting finding is that for some cells, one set of peaks of the grating tuning curves and the main peak of the noise tuning curve tend to align approximately at a certain disparity. They called this disparity the characteristic disparity of the cell (cf. Fig. 8). They concluded that their data are consistent with the traditional position-shift type of RF organization (termed CD model in their paper) but not with the phaseparameter type of RF model proposed by Freeman and coworkers. To resolve this controversy, it is important to note that one cannot predict a cell’s disparity tuning curves to a given set of stimuli with only the knowledge of the cell’s RF profiles. The other crucial piece of information is a procedure that determines a cell’s response as a function of its RF profiles and the visual pattern falling on the RFs. We will call this procedure the response model of a cell. Obviously a given RF model can be combined with different response models to produce different
Binocular Receptive Field Models and Disparity Tuning
1613
disparity tuning curves. (For example, one response model might add the contributions from the two RFs of a binocular cell, while another might multiply the two RF contributions.) Unfortunately, Wagner and Frost did not specify a response model when stating their conclusion. In addition, they did not model the shapes of the disparity tuning curves they recorded. We therefore decided to investigate how a cell’s disparity tuning to a stimulus depends on its underlying RF model and response model and to reexamine the implications of Wagner and Frost’s CD data. We will show that with a physiologically determined response model for simple cells, neither the position-shift-based RF description nor the phase-parameterbased RF description has a CD. CDs can be defined only at the level of complex cells. We will suggest methods for correctly distinguishing the two types of RF models. We will also consider the possibility of a hybrid model and demonstrate how to determine the relative contributions of position shifts and phase parameters to the disparity tuning of real cells. Some preliminary results have been reported previously in abstract form (Qian 1994b). 2 Analyses and Simulations
Since we are concerned only with horizontal disparity, we will use onedimensional RF profiles in our analyses and simulations. Freeman and coworkers found that binocular spatial RFs of a typical simple cell can be described by two Gabor functions with the same gaussian envelopes but different sinusoidal modulations. Mathematically, the left and right RFs of a simple cell centered at x = 0 are given by the following equations:
where o and wo are the gaussian width and the preferred (angular) spatial frequency of the RFs and $1 and are the left and right phase parameters. Intuitively, the gaussian terms in the Gabor functions determine the overall sizes and locations of the RFs, and the sinusoidal terms determine the excitatory and inhibitory subregions within the RFs. The difference between the two phase parameters generates a relative displacement between the sinusoidal modulations as well as a shape difference between the two RF profiles (see Fig. 1).We will show that this displacement will be related to the preferred disparity at the level of complex cells. In contrast, the traditional position-shift type of spatial RF model assumes an overall displacement (for both the envelopes and the modulations) between the left and right RF profiles. Under this model, the two
Yu-Dong Zhu and Ning Qian
1614
(a) Phase-parameter based RF model Left RF
Right RF
(b) Position-shift based RF model Left RF
Right RF
Figure 1: Profiles of the two types of binocular RF models considered in this paper. Only the horizontal dimension is considered. The dot over the vertical axis marks the peak position of the gaussian envelopes. (a) The phaseparameter-based RF model proposed by Freeman et d.(Freeman and Ohzawa 1990; Ohzawa et nl. 1990; DeAngelis rt nl. 1991). The left and right RF profiles (solid lines) of a binocular simple cell are assumed to be described by two Gabor functions, one for each eye, with the same gaussian envelopes (dotted lines) but two different phase parameters in their sinusoidal modulations. The mathematical descriptions of these RFs are given by equations 2.1 and 2.2. The difference between the left and right phase parameters ( A 0 = ol - or)generates a relative shift between the left and right sinusoidal modulations within their registered gaussian envelopes. It also creates a shape difference between the two RF profiles. (b) The traditional position-shift-based RF model favored by Wagner and Frost (1993). The left and right RFs have identical shapes but an overall horizontal shift d between them (i.e., the same amount of shift applies to both gaussian envelope and sinusoidal modulation). The mathematical descriptions of these RFs are given by equations 2.3 and 2.4. Simple and complex cells can be built with either type of RF models (see Section 2).
Binocular Receptive Field Models and Disparity Tuning
1615
RF profiles of a binocular simple cell can be written as:
f i ( x ) =J(x
- d) E exp
(-9 + 4)) COS(WO(X
-d)
where u and W O are again the gaussian width and the preferred spatial frequency of the RFs. q5 is a common phase parameter included for generality; it is always the same for the left and right RFs and therefore not related to disparity sensitivity. The two RF profiles have identical shapes but are shifted relative to each other by the distance d . These two different types of RF models are depicted in Figure 1. In addition to a spatial RF model, we also need a response model in order to calculate a cell’s disparity tuning to a stimulus. The most quantitative response models to date for binocular cells also come from physiological studies by Freeman et al. (Freeman Ohzawa 1990; Ohzawa et al. 1990; DeAngelis et al. 1991; see also Ferster 1981 for an earlier study with quantitative modeling). They showed that to a good approximation, a binocular simple cell’s response can be determined by first computing the correlation between the spatial RF profile and the visual pattern falling on it for each eye, and then adding the two correlations from the two eyes. They further showed that a binocular complex cell’s response can be modeled by summing the squared responses of a quadrature pair of binocular simple cells. We will use these response models in our calculations. 2.1 Disparity Tuning of Simple Cells. We first consider simple cell disparity tuning curves. According to the physiological studies by Freeman et aI., the response of a simple cell is given by:
wherefi(x) andf,(x) are the left and right RF profiles, and I,(x) and I,(x) are the left and right retinal images of the stimulus. (For a layer of simple cells with identical properties but different RF locations, equation 2.5 should be written as a convolution.) Note that such a linear response model does not take into account the effect of contrast saturation. However, the model is good enough for our purpose since we are mainly interested in the peak locations of disparity tuning curves (see Section 3). Also note that the temporal dimension of the RFs and stimulus is not included because we have shown elsewhere that it does not affect the disparity tuning of a cell unless an interocular time delay is introduced (Qian and Andersen 1996).
1616
Yu-Dong Zhu and Ning Qian
With equation 2.5, the responses of simple cells can be calculated for any given RF profiles and stimuli, either numerically or analytically. The details of our mathematical analysis are given in the Appendix. The main conclusion is that simple cells with either the phase-parameter-based RF description or the position-shift-based RF description cannot have a CD because their disparity tuning to any stimuli strongly depends on the Fourier phases (i.e., the phases of the Fourier transforms) of the stimuli. Two independently generated spatial noise patterns have different Fourier phases even when they contain the same disparity and have identical overall textural appearance. Consequently, the disparity tuning curves obtained with two sets of independently generated spatial noise patterns will have different peak locations. This result was confirmed previously through computer simulations for simple cells with the phaseparameter-based RF model (Qian 1994a). Similar simulation results for a simple cell with the position-shift-based RF model are shown in Figure 2. Here the disparity tuning curves of a simple cell to two sets of independently generated random dot patterns are plotted. It is clear from the figure that the peak locations of the two tuning curves from the same simple cell are very different. Similarly the disparity tuning curves of a simple cell to two sets of sinusoidal gratings of the same frequency but positioned differently with respect to the cell’s RFs will also have different peak locations (results not shown). Simple cells therefore do not have well-defined disparity tuning curves (Ohzawa ct n2. 1990; Qian 1994a). Since CD is defined according to the peak locations of the noise and grating disparity tuning curves, we conclude that simple cells do not have a CD. The dependence of simple cell responses to stimulus Fourier phases can be understood intuitively by considering the disparity tuning of a simple cell to a vertical line. The Fourier phase of the line is simply proportional to the position of the line. For a given disparity of the line, the response of the simple cell is not fixed; it also depends on the position (or equivalently, the Fourier phase) of the line in the RFs since the cell has separate excitatory and inhibitory subregions within its RFs. A given disparity of the line may evoke strong response at one line position because it happens to fall on the excitatory subregions of both RFs but may evoke a much weaker response at a different line position because it may now happen to stimulate some inhibitory part(s) of the RFs. Therefore, disparity tuning curves of simple cells to line stimuli are Fourier-phase-dependent. Similar arguments can be made for the cases with spatial noise patterns and sinusoidal gratings. 2.2 Disparity Tuning of Complex Cells. We now turn to complex cell responses. Freeman and coworkers also proposed a response model for complex cells based on their quantitative physiological experiments. They found that the response of a binocular complex cell can be simulated by summing up the squared responses of a quadrature pair of simple
Binocular Receptive Field Models and Disparity Tuning
1617
Simple cell
0 M
c
0 Q (I)
0
K
Disparity (degree)
Figure 2: Normalized disparity tuning curves of a simple cell with the positionshift-based RF model to two sets of independently generated random dot patterns. The peak locations of the two tuning curves from the same simple cell are very different. The main peaks of both curves do not correspond to the cell’s shift parameter d (marked by the vertical line). Similar results were obtained from simple cells with the phase-parameter-based RF models (Qian 1994a). Therefore, simple cells with either type of RF model do not have welldefined disparity tuning curves and cannot have a CD. The parameters used in the simulations are = 2 degrees, wo/27r = 0.25 cycles per degree, and the relative shift between the left and the right RFs d = 1 degree. One degree was represented by 4 pixels in the simulations. Both sets of random dot patterns had a dot density of 50% and dot size of 1 pixel. Each set was created by horizontally shifting two identical patterns with respect to each other by different distances.
cells (Freeman and Ohzawa 1990; Ohzawa et al. 1990). This quadrature pair method is a binocular generalization of that used previously in motion energy models (Adelson and Bergen 1985; Watson and Ahumada 1985) and has also been derived based on theoretical considerations by Qian (1994a). Specifically, two binocular simple cells are said to form a quadrature pair if the sinusoidal modulations of their left and right RFs both have a 90-degree phase difference while all the other parameters of the two cells are identical. Therefore, for the phase-parameter-based RF description, two cells form a quadrature pair if their left and right phase parameters in equations 2.1 and 2.2 are related by: 41.2
= 91.1
4r,2
= 41.1
+ .rr/2: +TI2
Yu-Dong Zhu and Ning Qian
1618
where the subscripts 1 and 2 label the parameters of the two cells in the pair. For the position-shift type of RF description described by equations 2.3 and 2.4, there is a common phase parameter for both the left and right RFs, and this parameter is related by: 02 =01
+ ;r/2
(2.8)
for a quadrature pair of simple cells. The response of a complex cell constructed from a single quadrature pair is then calculated according to:
r,, = ( r,., 1’
+ ( r . 2 )2 .
(2.9)
where r>,l and rI,? are the responses of the two simple cells in the pair. Note that one can also replace the plus signs in equations 2.6 to 2.8 bv the minus signs without changing the response of a quadrature pair since such a transformation merely reverses the sign of the simple cell responses (see equations 2.1 to 2.5). Based on both physiological and computational grounds, we add one final step to the above response model for complex cells: Instead of using a single quadrature pair of simple cells to compute the response of a complex cell, we perform a weighted average of several quadrature pairs with nearby and overlapping RFs. All other parameters are identical among these quadrature pairs. Mathematically, the complex cell response is given by:
r,
=
r,! * 711.
(2.10)
where rti is the response of a single quadrature pair given by equation 2.9, i u is a spatial weighting function, and * denotes the convolution operation. This procedure can be viewed as the implementation of the physiological fact that average complex cells have somewhat larger RFs than those of simple cells (Hubel and Wiesel 1962; Schiller rt a / . 1976). (Without this pooling step, a complex cell constructed from a single quadrature pair would have the same RF size as that of the constituent simple cells.) Computationally, this averaging step makes the disparity tuning of the resulting complex cells much more reliable (see the Appendix; Qian and Zhu 1995). In our simulations, the weighting function 70 was chosen to be a symmetric two-dimensional (2D) gaussian. We found that the disparity tuning curves of the complex cells are not very sensitive to the width CT:,, of the gaussian so long as it is larger than 1 pixel. (The sampling artifacts associated with a very narrow gaussian are not important here because any reasonable weighting function can be used for the pooling step.) Equations 2.6 to 2.10 specify a complete response model for complex cells. We can now calculate the disparity tuning curves of complex cells constructed from simple cells with either the phase-parameter-based or the position-shift-based RF description. As we show in the Appendix, some general analyses can be made without explicit knowledge of the
Binocular Receptive Field Models and Disparity Tuning
1619
Complex cell
c
-6
-4
-2
0
2
4
6
Disparity (degree)
Figure 3: Normalized disparity tuning curves of a complex cell with the position-shift-based RF model to two sets of independently generated random dot patterns. Unlike the simple cell shown in Figure 2, the two tuning curves of the complex cell have very similar shapes and nearly identical peak locations. Similar results (not shown) were obtained with the phase-parameter-based RF models. Therefore, complex cells with either type of RF models have welldefined disparity tuning curves. The complex cell was constructed from simple cells with the same parameters as the simple cell in Figure 2. The spatial weighting function w (see equation 2.10) was chosen to be a 2D gaussian with a oTo equal to 2 pixels. The random dot patterns were generated in the same way as those used in Figure 2.
input stimuli. The analytical results indicate that, unlike simple cells, the responses of complex cells depend only on the Fourier amplitudes of the input stimuli, not on their Fourier phases. This is true for both the phase-parameter- and the position-shift-based RF descriptions. Therefore, complex cells with either type of RF description do not suffer from the Fourier phase problem, and they have well-defined disparity tuning curves. We have performed computer simulations to confirm this conclusion. An example for a complex cell with the position-shift-based RF model is shown in Figure 3. The figure shows the disparity tuning curves of the complex cell to two sets of independently generated spatial noise patterns. Unlike the simple cell shown in Figure 2, the two tuning curves of the complex cell have very similar shapes and nearly identical peak locations. Similar results (not shown) have been obtained for complex cells with the phase-parameter-based RF model.
1620
Yu-Dong Zhu and Ning Qian
The existence of reliable disparity tuning curves is a necessary but not sufficient condition for the existence of CD. We next examine whether model complex cells can ha\re CDs similar to those found in real cells recorded by Wagner and Frost (1993) and how the results depend on the types of RF models. 2.3 Complex Cells with Position-Shift-Based RF Model. We first consider complex cells constructed from simple cells with the positionshift-based RF model. Since CDs are defined according to the peak locations of the noise and grating disparity tuning curves, we calculate complex cell responses to these stimuli. It can be shown that the response of a complex cell to a spatial noise pattern with disparity D is gi\,en by (see the Appendix):
(2.11) where /) denotes the Fourier amplitude of the stimulus (a frequencyindependent constant for noise patterns). cr, LO, and d are the intrinsic parameters of the simple cells used to construct the complex cell. They are, respectively, the gaussian width, the preferred spatial frequency, and the shift between the left and right RFs of the simple cells. The disparity tuning curve of the complex cell to spatial noise patterns can be obtained by plotting equation 2.11 as a function of stimulus disparity D while keeping all other parameters constant. One such plot is shown in Figure 4a. According to equation 2.11, a complex cell with the position-shiftbased RF profiles responds optimally when the disparity D of the spatial noise stimulus is equal to the relative displacement d between the two RFs. Therefore, the disparity tuning curve has a main peak at D = d. It also has side peaks at D = d i 2si1/-i0 where ti = 1.2. ’ . . . The distance between any two adjacent peaks in the tuning curve is equal to the preferred spatial period of the cell (2x/s0). The side peaks decay away with increasing difference between the disparity D and the cell’s shift parameter d according to the gaussian term in equation 2.11. The response of the complex cell to a sine wave grating with spatial frequency (2 is given by (see the Appendix): (2.12) where c is a constant independent of disparity D. A plot of equation 2.12 is shown in Figure 4b. The gaussian term in equation 2.12 determines the spatial frequency tuning of the cell; only those gratings with frequencies (52) near the cell’s preferred frequency ( d o ) can elicit good responses from the cell. Note that unlike equation 2.11, the gaussian term in equation 2.12 is not a function of stimulus disparity D.It therefore contributes only a global scaling factor to the disparity tuning curve. The shape of
Binocular Receptive Field Models and Disparity Tuning
1621
Analytical Tuning Curves (Position-shift Model)
(a) Noise
(b) Sinusoidal Gratings
Disparity (degree)
Figure 4: Normalized disparity tuning curves of a complex cell with the position-shift-based RF model plotted according to the analytical results in equations 2.11 and 2.12. (a) Tuning curve to spatial noise patterns. (b) Tuning curves to sinusoidal gratings with spatial frequencies (L?2/2~) equal to 0.154 (solid line), 0.25 (dotted line), and 0.4 (dashed line) cycles per degree, respectively. The cell parameters are u = 2", wo/27r = 0.25 cycles per degree, and the relative shift between the left and the right receptive fields d = 1 degree. These curves show a CD (marked by the vertical line) at D = 1 degree. the disparity tuning curve is determined by the periodic cosine term in equation 2.12. (The periodicity of the grating tuning curve is expected because as the disparity between the left and right gratings reaches one full cycle, the two gratings become identical, and the disparity falls back to zero.) For a given frequency R of the grating, the tuning curve has
1622
Yu-Dong Zhu and Ning Qian
many evenly spaced peaks of the same height, with the distance between any two adjacent peaks equal to the spatial period of the grating (27r/C2). Tuning curves obtained under different grating frequencies have different spacings between their peaks and, in general, different peak locations. However, according to equation 2.12 there is always a response peak at D = d for all grating frequencies (Q). This is also the location of the main peak in the noise tuning curve of the same cell (see equation 2.11). We therefore conclude that a complex cell with the position-shift-based RF description has a CD equal to the shift parameter i f . To see the above conclusion graphically, we have plotted the normalized disparity tuning curves to spatial noise patterns and to sinusoidal gratings of three different frequencies using the analytical expressions in equations 2.11 and 2.12. The results are shown in Figure 4. The gaussian width (T and the preferred spatial frequency (d0/27r) of the constituent simple cells’ binocular RFs are 2 degrees and 0.25 cycles per degree, respectively, and the shift parameter d is 1 degree. As expected, the cell has a CD of 1 degree, which is marked by the vertical line in the figure. The above set of parameters was chosen for illustrative purposes. Other sets of parameters work equally well. For example, to model a parafoveal cell with small RFs, we could scale down the above set of parameters by a factor of, say, 10. The resulting complex cell would have RFs with a (T equal to 0.2 degree and a CD equal to 0.1 degree. This comment applies to all the analytical and simulated results throughout the paper. Reasonable approximations were used in deriving the analytical results above (see the Appendix). To check the accuracy of our analyses, we have also performed numerical simulations. An example is shown in Figure 5, where normalized noise and grating disparity tuning curves of a complex cell with the position-shift-based RFs are plotted. For the purpose of comparison, we have chosen the parameters in the simulations to be identical to those for plotting the analytical results in Figure 4. For cells with RF gaussian width ( 0 ) equal to 2 degrees, 4 pixels were used in our simulations to represent 1 degree of visual angle, and 65 pixels were used to describe each RF profile. Input stimuli with different disparities were generated by shifting a pair of fixed patterns relative to each other by different horizontal distances. The complex cell responses were computed by averaging over adjacent quadrature pairs through a 2D gaussian weighting function with n:,, equal to 2 pixels. This means that the RF dimension of the complex cell is 25% larger than that of the constituent simple cells. The simulation results shown in Figure 5 are in good agreement with our analytical derivations plotted in Figure 4; both indicate that the cell has a CD at 1 degree. The calculated tuning curves in Figure 5 are very similar to those of the example cell shown in Figure 2 of Wagner and Frost (1993). A difference, however, is that in Figure 5, one set of the peaks in the grating tuning curves coincides exactly at the CD location, while for the real cell
Binocular Receptive Field Models and Disparity Tuning
1623
Simulated Tuning Curves (Position-shift Model)
(a) Noise 100.
20 .
(b) Sinusoidal Gratings
Disparity (degree)
Figure 5: Normalized disparity tuning curves of a complex cell with the position-shift-based RF model obtained through numerical simulations. (a) Tuning curve to spatial noise patterns. They were random dot patterns with a dot density of 50% and a dot size of 1 pixel. Different image disparities were generated by shifting two identical patterns with respect to each other by different distances. (b) Tuning curves to sinusoidal gratings with spatial frequencies (R/27r) equal to 0.154 (solid line), 0.25 (dotted line), and 0.4 (dashed line) cycles per degree, respectively. The cell parameters are identical to those used in Figure 4. The simulated results are in good agreement with the analytical results in Figure 4; both indicate a CD of 1 degree.
Yu-Dong Zhu and Ning Qian
1624
(Figure 2 in Wagner and Frost 1993), the peak of the curve with the lowest of the three spatial frequencies is significantly shifted rightward. We will return to this point. 2.4 Complex Cells with Phase-Parameter-Based RF Model. Similarly, we can calculate the disparity tuning curves for complex cells with the phase-parameter-based RFs described by equations 2.1 and 2.2. The details of our mathematical analyses are presented in the Appendix. The response of a complex cell to a spatial noise pattern with disparity D is found to be: (2.13) where
A()= ( i , - 0 ,
(2.14)
,
Here
is again the constant Fourier amplitude of the noise stimulus. and are the intrinsic parameters of the simple cells used to construct the complex cell. They are, respectively, the gaussian width, the preferred spatial frequency, and the left-right phase parameter difference of the simple cells. A plot of this equation is shown in Figure 6a. The cosine term in equation 2.13 is similar to that of equation 2.11; it has peaks located periodically at D = 27ri2/4 where 71 = 0.1.2. . . .. The ratio A()I;,) here is equivalent to the shift d of the positionshift model. Unlike equation 2.11, however, the gaussian term in equation 2.13 is always centered at D = 0. Consequently, the main peak of equation 2.13 is the peak of the cosine term that is closest to D = 0. Since the cosine term has peaks occurring periodically with a period equal to the preferred spatial period of the cell ( 2 7 / 4 ) , the main peak has to fall in the range [ - x / ~ , ) x/.iil]. . We conclude that for a complex cell with the phase-parameter-based RF model, the main peak of its noise disparity tuning curve is always larger than the negative half preferred spatial period and smaller than the positive half preferred spatial period of the same cell. This relation is shown schematically in Figure 7. Such a constraint does not exist for complex cells with the position-shift-based RF model. Note that a constraint similar to that shown in Figure 7 has been proposed previously by Marr and Poggio (1979). However, we derived the constraint by analyzing a physiologically determined complex cell model while they reached the conclusion through the nonphysiological procedure of explicitly matching the zero crossings in the left and right images (see Qian 1994a). Since equation 2.13 is invariant when A(:)is replaced by 1 0 + 2 7 ~with, out loss of generality wc can restrict Ao to be within the range [.]... Under this convention, the main peak of equation 2.13 will always be at D = Ac>/d(l,and the side peaks at D = Ao/;,1*2aiz/.~'O where I I = 1.2. . . .. (T,
-30,
~ K J
*
Binocular Receptive Field Models and Disparity Tuning
1625
Analytical Tuning Curves (Phase-parameter Model)
(a) Noise
I
(b)Sinusoidal Gratings 100 80 60 40
20
Disparity (degree)
Figure 6: Normalized disparity tuning curves of a complex cell with the phaseparameter-based RF model plotted according to the analytical results in equations 2.13 and 2.15. (a) Tuning curve to spatial noise patterns. (b) Tuning curves to sinusoidal gratings with spatial frequencies (02/27r)equal to 0.154 (solid line), 0.25 (dotted line), and 0.4 (dashed line) cycles per degree, respectively. The set of cell parameters was chosen to match closely those used in Figures 4 and 5 for the position-shift case, with (T = 2", w0/27r = 0.25 cycles per degree, and Aq5 = 7r/2. These curves show an approximate CD (marked by the vertical line) at D = 1 degree. Note that the peak locations of the grating tuning curves show a systematic deviation around the CD similar to the real cell in Figure 2 of Wagner and Frost (1993).
Yu-Dong Zhu and Ning Qian
1626
/ .. . .. .. .. . . 0
0
slope = 1/2 0
0 0 0 .
<*
‘
0
.
.
0 0
.
0
0
.
0
w 0 1
0
.
0 0
0
.
0
0
\ \
\
slope = -1/2
\
0
.
0
.
Preferred spatial period
.
a
\
a \ \ \
Figure 7 : Constraint on the main peak locations (CDs) of complex cells’ noise disparity tuning cur\res under the phase-parameter-based RF model. According to equation 2.13, the main peak of the noise disparity tuning curve of a cell should be larger than the negative half of its preferred spatial period and smaller than the positkre half of its preferred spatial period. This constraint is represented by the two dashed lines in the figure. Each filled dot in the figure represents a hypothetical data point from a complex cell that satisfies this constraint. This constraint does not apply to complex cells with the positionshift-based RF model. We will adopt this convention for the rest of the paper. The side peaks decay away with increasing disparity D according to the gaussian term in equation 2.11. The response of the phase-parameter-based complex cell to a sinusoidal grating with spatial frequency 0 is given by (see the Appendix):
(2.15)
A plot of equation 2.15 is shown in Figure 6b. According to this equation, for gratings with frequency (2, one peak of the disparity tuning curve occurs at D = Lo/O. Unlike complex cells with the position-shift-based RFs (see equation 2.12), this peak location is not completely determined by the intrinsic parameters of the cell but depends on the grating frequency a. Consequently, the peaks of tuning curves from gratings of different frequencies will not coincide. Therefore, strictly speaking one cannot define a CD for complex cells with the phase-parameter-based RF descriptions.
Binocular Receptive Field Models and Disparity Tuning
1627
However, equation 2.15 has a gaussian term that determines the spatial frequency tuning of the cell; only those gratings with frequencies (a) around the preferred frequency (wg) of the cell can elicit good responses from the cell. If one probes the cell only with Q’s around wo in order to get good responses, then one set of peaks of the grating tuning curves will be distributed closely around the disparity A$/wg. Since this is also the location of the main peak of the disparity tuning curve to noise patterns, for practical purposes we can define an approximate CD equal to A$/wo for complex cells with the phase-parameter-based RFs. Note that the above argument relies on the fact that real V1 cells are usually very well tuned to their preferred spatial frequencies. For complex cells with broader frequency tuning, the CD will become less well defined under the phase-parameter model. To see the above argument more clearly, we have plotted the normalized disparity tuning curves to spatial noise patterns and to sinusoidal gratings of three different frequencies using the analytical expressions in equations 2.13 and 2.15 (see Figure 6). The set of cell parameters was chosen to match closely those used in Figures 4 and 5 for the positionshift case. Specifically, the gaussian width ( a ) and the preferred spatial ~ ) the same as those used in Figures 4 and 5. The frequency ( ~ 0 / 2 7are left-right phase parameter difference A$ is 7r/2 so that the expected CD (A@/uo) of the complex cell is 1 degree, the same as the CD value in Figures 4 and 5. As can be seen from Figure 6, the cell indeed has an approximate CD of 1 degree. The tuning curves in Figure 6 capture the main features of the example cell in Wagner and Frost (1993). We therefore conclude that the mere existence of an approximate CD in real cells should not be taken as evidence against the phase-parameter-based RF description. Note that there is a systematic deviation of the peak locations around the CD for the grating disparity tuning curves: The peak location shifts rightward with decreasing spatial frequency of the grating. (For negative CDs, the peak locations will shift leftward with decreasing spatial frequency.) A similar deviation is also present in the example cell reported by Wagner and Frost (1993). This systematic deviation is not predicted by complex cells with the position-shift-based RF description. In the expanded version of their paper, Wagner and Frost (1994) showed in their Figure 9b that the peaks of the sinusoidal tuning curves shift with the grating frequency for the majority of the cells. The peak deviations of the grating disparity tuning curves can be easily understood based on the above discussions of equation 2.15. Note that the CD is defined at the peak location of the noise disparity tuning curve A$/wo, while the peaks of the grating disparity tuning curves actually occur at Ad/f2. Therefore, for the grating tuning curve with a spatial frequency f2 smaller (larger) than the preferred frequency ~0 of the cell, its peak location around the CD will be further away from (nearer to) D = 0 than the CD. For the grating tuning curve with a frequency R
1628
Yu-Dong Zhu and Ning Qian
equal to the preferred frequency Ll,lof the cell, it has a peak precisely at the CD. Again, we performed numerical simulations in order to check the accuracy of our analyses. The simulation results (not shown) are in good agreement with our theoretical analyses. 2.5 How to Distinguish the Two Types of RF Models. We concluded above that the existence of an approximate CD in real cells should not be taken as evidence for rejecting the phase-parameter-based RF description. Our results also suggest methods for correctly distinguishing the two RF models. One method is to examine whether the peaks of grating tuning curves align precisely at the CD, as shown in Figures 4 and 5, o r whether the alignment is only an approximate one with a systematic deviation, as shown in Figure 6. If the systematic deviation exists in a real complex cell, this is clear evidence that the cell cannot be described by a purely position-shift-based RF model since such a model always predicts a precise alignment. A potential problem with this method is that errors in the experimental measurements may render such a comparison impossible. This problem can be alleviated by recording from cells with high firing rates and by using high-contrast gratings with frequencies as different from the preferred frequency of the cell as possible. The second method is to examine the relation between the CD and the preferred spatial period (27i/4,) of the same complex cell. As we discussed in relation to equation 2.13, with the phase-parameter-based RF description, the main peak (and therefore the CD) of a complex cell’s (see Fig. 7). Such a noise tuning curve is always in the range [-a/do. constraint between the CD and the preferred spatial period does not exist for the position-shift-based RF description. If a cell’s CD and preferred spatial frequency violates this constraint, this is a clear indication that the cell’s RF cannot be described by a purely phase-parameter type of RF model. On the other hand, if the constraint is obeyed by the real cell, the situation is less conclusive; one can always argue that although the position-shift type of RF does not impose such a relationship, it could happen by chance, or it could be due to some other reasons. However, if the CDs of a large population of real cells all satisfy the constraint, this would be strong evidence for the phase-parameter-based RF model. The third method for distinguishing the two RF models is by comparing the heights of the two side peaks surrounding the main peak in the noise tuning curve. The position-shift model predicts an equal decay of amplitude on either side of the main peak (see equation 2.11 and Figure 4) while the phase-parameter model predicts that the side peak closer to zero disparity should be higher than the one further away because the gaussian decay term is centered at zero disparity (see equation 2.13 and Fig. 6). Furthermore, the phase-parameter model predicts that the height difference between the two side peaks should increase with the value of CD, decrease with the receptive size, and decrease with the preferred spatial frequency. Although the stochastic nature of the spatial noise pattern
Binocular Receptive Field Models and Disparity Tuning
1629
may by itself introduce some small variations in the heights of the side peaks (see the simulated noise tuning curve in Fig. 5), it should still be possible to apply the test to a large number of real cells and to examine if there is a significant trend over the population. In this connection, it is interesting to note that when this test is applied to the three reported noise tuning curves with clear side peaks (Fig. 2 of Wagner and Frost 1993 and Figs. 7 and 8a of Wagner and Frost 1994), in all three cases the side peaks closer to zero disparity are higher than the one further away, suggesting that the phase-parameter model is favorable. The results from the above tests could be contradictory, with some test(s) favoring one RF model and the remaining test(s) favoring the other model. If this happens, a hybrid RF model should then be considered. 2.6 A Hybrid RF Model with Position Shift and Phase Parameters. The experiments by Freeman et al. were performed on anesthetized cats. Consequently, the absolute spatial correspondence between the left and right RF profiles of a cell cannot be accurately determined, although the shape of each RF profile can be measured with high precision. It is only an assumption that the two gaussian envelopes are aligned exactly at the corresponding retinal positions. It is therefore possible that real complex cells may use a combination of the phase-parameter- and position-shiftbased binocular RFs to encode disparity. It can be shown that for a complex cell constructed from such a hybrid RF model, the disparity tuning functions to noise patterns and sine wave gratings can be obtained by replacing D in equations 2.13 and 2.15 by ( D - d):
Thus, if one probes the cell with grating frequencies (12) around the cell’s preferred frequency (wo), the cell will appear to have an approximate CD equal to the sum of the contributions from the positional shift and the phase parameters: CD zz d + A ~ / W OThere . will still be a systematic deviation of the peak locations around the CD for the grating disparity tuning curves, but the relative deviation with respect to the magnitude of CD will be smaller than that in a purely phase-parameter-based approach because now the phase parameters contribute only part of the total CD. For a fixed d, the value of the CD will now fall in the range of [d 7r/wo. d 7r/wO]; the constraint shown in Figure 7 should be displaced along the vertical axis by d. For real complex cells, it is also easy to estimate the relative contributions of the position shifts and phase parameters to their disparity tuning. Assume a cell’s disparity tuning is generated by a position shift
+
Yu-Dong Zhu and Ning Qian
1630
d and a phase parameter difference 10between its left and right RFs. By measuring the peak location (01)of its disparity tuning curve to spatial noise patterns, we have the relation
D,=d+-
AQ
(2.18)
-iO
according to equation 2.16. Next, we measure the peak location ( 0 2 ) of one grating tuning curve (with grating frequency I!) near the CD and have another equation: (2.19) dccording to equation 2.17. The preferred spatial frequency -JU of the cell can be measured separately from the cell’s spatial frequency tuning curve, or it can be estimated from the spacing between the peaks in the noise disparity tuning curve (which is equal to 27~/-~,, according to equation 2.16). We can therefore solve for d and -lofrom equations 2.18 and 2.19. It is interesting to note that the example cell in Wagner and Frost (1993) can be best modeled by a mixed RF description. That cell showed a deviation in the peak locations around the CD in its grating tuning curves It therefore cannot be explained by a pure position-shift-based RF model. In addition, its CD was larger than half of its preferred spatial period (the preferred spatial period of the cell was not stated by the authors, but it should be approximately equal to the period of its noise tuning curve according to equation 2.16) and therefore cannot be explained by a purely phase-parameter-based RF model. Only a hybrid model can account for both aspects. We have performed computer simulations to model the tuning curves of this cell with the mixed RF descriptions. The left-right phase difference 10is chosen to be 712, and the position-shift parameter d is 1 5 degrees. These parameters were determined according to the method described in the previous paragraph (D2 was measured from the grating tuning curve with the lowest spatial frequency). The gaussian width rr is set to be 1 degree, and the preferred spatial frequency (4/2a)is 0.5 cycle per degree. The results are shown in Figure 8. The cell has an approximate CD of 2 degrees, as expected based on our analyses. The tuning curves compare well with those of the real cell in Figure 2 of Wagner and Frost (1993). Although this demonstrates the requirement o f a hybrid model for describing the RF profiles of this particular cell, a general conclusion can be drawn only after examining a large number of real cells. In the expanded version of their paper, Wagner and Frost (1994) reported recordings from a few more cells. Unfortunately, none of the recordings contained a complete set of tuning curves to allow a similar analysis as we have done above. For example, their Figure 6 does not contain the cell’s noise tuning curve, which is needed in order to determine the CD location and the preferred spatial frequency of the cell.
Binocular Receptive Field Models and Disparity Tuning
1631
Simulated Tuning Curves (Mixed Model)
(a) Noise 100
80
(b) Sinusoidal Gratings
Figure 8: Simulation of the disparity tuning curves of the real cell in Figure 2 of Wagner and Frost (1993). A complex cell with a mixed RF model was used, and the computed tuning curves were presented in a format similar to that for the real cell. (a) Normalized tuning curve to spatial noise patterns. The patterns were generated in the same way as those in Figure 5. (b) Normalized disparity tuning curves to sinusoidal gratings with spatial frequencies equal to 0.25 (solid line), 0.4 (dotted line), and 0.667 (dashed line) cycles per degree, respectively. These frequencies correspond to the effective grating periods (4, 2.5, and 1.5 degrees) used for the real cell. The tuning curves were deliberately truncated for better comparison with the real data. The peak locations of these curves agree well with those of the real cell.
Yu-Dong Zhu and Ning Qian
1632
Their Figure 10 showed a cell’s noise tuning curve but only one grating tuning curve. If the spatial frequency of the grating was the cell’s preferred frequency, then the phase-parameter model would also predict that the main peak of the noise tuning curve should line up with one of the peaks in the grating tuning curve 2.7 An Examination of Wagner and Frost’s Data Analysis. In addition to our suggestions of possible experiments for distinguishing the two RF models, our theoretical results can also be used to examine the data analysis method presented in Figure 3 of Wagner and Frost (1993). The authors first fitted each experimentally measured grating tuning curve with a cosine function of the form
cos[!!,D+ a),]
(2.20)
where 0, and 9,are the frequency and phase of the ith grating tuning curve ( 2 , should be equal to the spatial frequency of the gratings used to obtain the ith grating tuning curve. They then calculated a mean disparity value (called MD in their paper) by using the main peak location of the noise tuning curve and the peaks of the grating tuning curves near the main peak. After that, they estimated the phase (called MP in their paper) predicted by the phase-parameter and the position-shift models according to (2.21) (2.22) where is the preferred spatial frequency of the cell. Finally, for the cells they recorded, they calculated the squared deviations between the measured phases a), and the predicted phases (MP) for each RF model. Since the deviation for the phase-parameter model is larger than that for the position-shift model, they concluded that the position-shift model is preferable (see Fig. 3b of Wagner and Frost 1993). We now examine their analysis in the light of our theoretical results. First, under the assumption of the position-shift RF model, the mean disparity MD should be simply the relative shift d between the left and right RFs. The predicted phase MP,,~,~lffL12f is therefore equal to 0 , d . This is the correct phase of the grating tuning curve according to our equation 2.12. On the other hand, if we assume the phase-parameter model is correct, MD can be expressed as (see equations 2.13 and 2.15)
MD =
A0
+
c” & !?
LUt,
N i l
.
(2.23)
where N is the number of grating tuning curves used in the calculation. Therefore, MP,,,,,,;,.does not yield the correct predicted phase of the grating
Binocular Receptive Field Models and Disparity Tuning
1633
tuning curve, which should be A4 according to our equation 2.15, unless (2.24) Unfortunately, this relation is not generally satisfied. Assuming that in the actual experiments, R;’s were chosen symmetrically around the preferred frequency wo,one can then show that MPphnsegives an overestimation of the phase in the grating tuning curves because (2.25) (The inequality can be proved for any positive wo and (1, that are not all identical to each other.) Since the squared deviation (Wagner and Frost 1993)of the phase-parameter model is already quite small (although larger than that of the position-shift model), even a small bias in the estimation of the model prediction could have significant consequences. Obviously, the correct calculation of the predicted phase under the phaseparameter model should use only the main peak location of the noise tuning curve. There is a potentially more fundamental problem with the data analysis in Figure 3 of Wagner and Frost (1993): The authors did not first classify cells into simple and complex and then exclude simple cells from their CD analysis. As we have shown, although one can measure disparity tuning curves from simple cells, their CDs are undefined no matter which RF model one chooses. In addition, in the expanded version of Wagner and Frost (1993) the authors mentioned that during their experiments, “single units were difficult to isolate” and that the majority of the recordings were multiunit (Wagner and Frost 1994). Consequently, many of their measured tuning curves were the average from several different cells. It is inappropriate to apply CD analysis to multiunit recordings unless the cells in a given recording were all complex and all had identical disparity tuning. We conclude that the existing physiological data by Wagner and Frost do not allow a clear distinction of the two RF models. 3 Discussion
In this paper, we have thoroughly analyzed the disparity tuning behavior of binocular simple and complex cells with both the position-shift-based and the phase-parameter-based RF descriptions. Besides the general interest of relating disparity tuning behavior of a cell to its RF structures, our work also addresses the specific question of which cell type and RF structure are most consistent with the CD data by Wagner and Frost (1993). We have derived analytical expressions for the disparity tuning curves for both simple and complex cells with either type of RF model. We have also confirmed our analyses through computer simulations. Our
1634
Yu-Dong Zhu and Ning Qian
results indicate that simple cells with either type of RF model cannot have a CD because these cells do not even have well-defined disparity tuning curves due to their dependence on stimulus Fourier phases. Model complex cells, on the other hand, do not suffer from this phase problem and have reliable disparity tuning curves. Furthermore, model complex cells with the position-shift-based RF description have a precise CD, and those with the phase-parameter-based RF description have an approximate CD. A testable prediction is that real cells found to have CDs should all be complex cells. We concluded based on these results that the mere existence of (approximate) CDs in real cells cannot be used to distinguish the phase-parameter-based RF description from the traditional positionshift-based RF description. It should be clarified that when we say that simple cells do not have well-defined disparity tuning curves, we do not mean that they do not have measurable disparity tuning; real simple cells do have disparity tuning (Bishop et d . 1971; Poggio and Fischer 1977). Rather, we mean that their disparity tuning curves change dramatically when the same type of stimuli with different Fourier phases are used in the measurements (see Fig. 2). There is experimental evidence suggesting that this is indeed the case. For example, Ohzawa rt 01. (1990) showed that disparity tuning curves of a simple cell measured with bright bars and dark bars are different (see also the Discussion in Qian 1994a). The simple cell model used in our analyses and simulations is identical to those proposed by Freeman et nl. (Freeman and Ohzawa 1990; Ohzawa ef 01. 1990). The complex cell model we used, on the other hand, differs slightly from theirs. One difference is only superficial: They separated the positive and negative responses of simple cells and therefore had four simple-type subunits in a quadrature pair, while we did not do the separation explicitly and had two simple cells in a quadrature pair. Mathematically, the two approaches are exactly equivalent. The real (and only) difference between our complex cell model and theirs is that we added a final spatial pooling step (see equation 2.10). The response of our model complex cell is therefore a weighted average of several quadrature pairs with nearby and overlapping RFs. Our analytical and simulation results (not shown) indicate that for the disparity tuning curves of bar stimuli measured and modeled by Freeman et nl., adding or not adding the spatial pooling step does not make any difference (Qian and Zhu 1995). For spatial noise patterns, however, the disparity tuning curves computed with the pooling step added are much more reliable and independent of stimulus Fourier phases than without the pooling step (see the Appendix). Therefore, by experimentally testing the reliability of disparity tuning to noise patterns, one could potentially determine whether the spatial pooling operation is indeed employed by real complex cells. Also note that spatial pooling is just one of the pooling methods widely used in the computer vision literature. One could also pool responses across different spatial frequency scales (Marr and Poggio 1979; Grzywacz and
Binocular Receptive Field Models and Disparity Tuning
1635
Yuille 1990; Fleet et al. 1995). However, we think spatial pooling is a natural choice for modeling complex cells because it accounts for the larger RF sizes of real complex cells and at the same time preserves complex cells’ frequency tuning properties. In contrast, pooling across different spatial frequency scales would render complex cells much less sensitive (i.e., more broadly tuned) to spatial frequency than simple cells, contradictory to experimental data (Shapley and Lennie 1985). Frequency pooling is therefore most likely to occur at a stage beyond complex cells, perhaps at the level of the middle temporal area (Grzywacz and Yuille 1990). We have previously developed a physiologically realistic algorithm for disparity computation using the phase-parameter-based RF description (Qian 1994a). The algorithm relies on the same simple and complex cell response models as described in this paper and uses a population of complex cells to encode stimulus disparity. In fact, equation 2.13 in this paper is a more accurate derivation of the complex cell response than equation 2.8 in Qian (1994a). The only difference between the complex cell model presented here and the one used in Qian (1994a) is that we have added a spatial pooling step in this paper (see equation 2.10). The quality of the computed disparity maps from random dot stereograms with the pooling step added is significantly better than those without the pooling step, especially at disparity boundaries (Qian and Zhu 1995). Our algorithm for disparity computation also works with the positionshift-based RF description since equation 2.11 indicates that a population of complex cells with the position-shift-based RF models can also form a distributed representation of stimulus disparity. We have performed computer simulations using the algorithm with both types of RF models. The computed disparity maps from random dot stereograms (Qian 1994a) using the two different RF models are very similar to each other, and both agree well with the actual disparity map (results not shown). However, under certain conditions, the computed disparity maps using different RF models may be somewhat different. This is the case for sinusoidal grating stimuli. Our analyses indicate that for sinusoidal gratings of any frequencies, the position-shift-based algorithm should always give the actual disparity value of the stimuli (within one spatial period of the gratings). For the algorithm based on the phase parameters, on the other hand, the disparity of those gratings with high spatial frequencies will be underestimated, while those with low frequencies will be overestimated. This result provides an opportunity for distinguishing the two types of RF models via visual psychophysical experiments. One major limitation with the simple and complex cell models we used is that their responses are quadratic functions of stimulus contrast, and therefore they do not account for the contrast saturation behavior of real cells. This problem, however, can be readily fixed with a normalization procedure (Albrecht and Geisler 1991; Heeger 1992). Indeed, normalization methods have already been used by Fleet ef al. (1995) to
1636
Yu-Dong Zhu and Ning Qian
account for experimental data on binocular contrast effects. The introduction of normalization will not affect the conclusions of this paper, however, because the normalization factor is a function of contrast but not a function of disparity (because it is obtained by summing over cells with all preferred disparities) and therefore will not change the peak locations of disparity tuning curves. Similarly, the normalization procedure will not affect our recent algorithm (Qian 1994a) for disparity computation either because the algorithm relies only on the location of peak disparity responses. We have also suggested new methods for distinguishing the phaseparameter-based and the position-shift-based RF models. One method relies on the fact that the CD of the complex cells with the phase-parameterbased RF models can be defined only approximately. The peaks of the sinusoidal tuning curves spread around the main peak of the noise tuning curve in a systematic way. This type of systematic deviation is not predicted by the position-shift-based RF model. The second method observes that for the phase-parameter-based RF model the CD of a complex cell has to occur in a range restricted by the preferred spatial period of the cell while this restriction does not apply to the position-shift-based RF model. The third method compares the side peak heights in the noise tuning curve. We suggest that by applying these methods to a large number of real cells, a better understanding of the binocular RF structure could be obtained. If conflicting results are obtained with these methods, one should then consider the hybrid RF model containing both a position-shift and a phase-parameter difference between the left and right RFs. We showed how to determine the relative contributions of the position-shift and the phase-parameter difference for real complex cells. In conclusion, our work provides a thorough characterization of the disparity tuning of simple and complex cells under the two different types of RF descriptions (and their hybrid) suggested by previous physiological experiments. These results not only provide an explanation of many aspects of the recent physiological data of Wagner and Frost (1993) but also generate specific predictions that may help guide future experimental determination of the neural mechanisms of disparity selectivity.
We outline our derivation of the simple and complex cell response functions for the phase-parameter- and the position-shift-based RF models in this Appendix. For an arbitrary stimulus with a disparity D, its left and right retinal images can be written as:
Binocular Receptive Field Models and Disparity Tuning
1637
According to the Fourier theorem, a function I ( x ) and its Fourier transform i(w) = F ( I ( x ) )are related by:
In general, i(w) takes a complex value and can be expressed by an am) a phase O(u), plitude p ( ~ and
I(w)
= p(w)ef8(u).
64.3)
In addition, the Fourier transform of I ( x + D ) is: .F(l(x + D ) ) = i(w)elUD.
('4.4)
based on the definition of Fourier transform. Substituting the above relations into the simple cell response model of equation 2.5, we have: +x
Y, =
icu
J dw J dxp(w)ef8(W)+'WX Ifi(x) + efWDfr(x)]. --oc
(A.5)
-32
It should be pointed out that we do not assume that the cortical cells perform Fourier transformations. The technique is used in our calculations merely as a mathematical tool to analyze cells' responses. When the RF profiles,f i ( x ) and f r ( x ) ,are specified, the spatial dependence of the integrand in equation A.5 is completely known. This allows us to carry out the integration over the variable x. For the positionshift-based RF model, the RF profiles are given by equations 2.3 and 2.4. Substituting them into equation A.5, we found that with the positionshift-based RF model, the simple cell response to an arbitrary stimulus of disparity D is given by:
(-4.6)
We have used the following identity in deriving the above equation:
Similarly, the simple cell response under the phase-parameter-based RF model can be derived as
Yu-Dong Zhu and Ning Qian
1638
Note that both equations A.6 and A.8 are dependent on the Fourier phase H ( & - ) of the external stimulus. This means that with a fixed disparity D, any change to the external stimulus that results in variation of its Fourier phase will effectively alter the response of simple cells. We therefore conclude that simple cells do not have reliable disparity tuning. An approximate version of equation A.8 was derived in Qian (1994a). Using the definition of a quadrature pair in equations 2.6 to 2.9 and the above simple cell response expressions, we found that the output of a quadrature pair of simple cells with the position-shift-based RF model is given by
ry
=
j y ! J 1 ' ~ ( ~+ . ~ [)y]~~' " ' ( c . - + ~
;;i2)j?
. . j ,pA.dLsf,,iA*)o(A, c
-x
=
/ ) L , - I-,'
8am'
.& , I / 2 0-
2 -l&-&,,l-"-
3
3
2
-. x
/?(A'') -
H(;)
(A,
t -
x cos
-
I
-..')(D - d) 2
J
[;i. -4 1 cos [ ~- id ) . ]
(-4.9)
where we have converted the squares of integration into double integrations. Similarly, the output of a quadrature pair of simple cells with the phase-parameter-based RF model is given by
rr.
=
[r!"'*(t-),. c )]'
+ [r!'"*inii ~ / 2o, . + a/2)12
---c
xcos
[
-
H(;)
+ -(&* -*''ID 2
1 (A.lO)
According to these expressions, the responses of a quadrature pair differ from that of the simple cells in that they depend on the difference of the Fourier phases of the input stimulus measured at two different frequencies (H(,*') - H ( L ) ) . Both integrands contain two gaussian factors that are significantly large only when both and d are approximately equal to do. This effectively makes L' - L very small. It also makes H ( J ) - H(,,) close to zero for the stimuli whose Fourier phases are smooth functions of frequency (such as lines, bars, or gratings). We can therefore neglect the H dependence in the above two equations for these stimuli by assuming
(A.12)
Binocular Receptive Field Models and Disparity Tuning
1639
However, 0(w) is not a smooth function of w for stimuli like the spatial noise patterns, and this is when the final pooling step for computing complex cell responses (see equation 2.10) becomes important. In this pooling step the responses of many quadrature pairs with nearby RFs (and with otherwise identical parameters) are averaged. The response expressions (equation A.9 or A.lO) for the different quadrature pairs are identical except for the 0 ( w ) functions, which are different for different pairs because they are centered on somewhat different parts of the noise stimuli. Therefore, the pooling step simply averages over the 6’ dependent cosine terms in equation A.9 or A.lO, and makes them approximately constant (as long as the stimulus patches covered by the pooled quadrature pairs contain many independent 0’s). The approximations in equations A . l l and A.12 are thus also valid for the noise stimuli after the pooling. Using equations A . l l and A.12, we can now reduce equations A.9 and A.10 to:
yr
+m
J J dWdW’p(W)p(w’)e-(W-WO)2~2/2e-(W’-W,~)zu2/2
gT02
-m
:1
xcos - ( D - d )
x cos
[;(Ad
I
cos - ( D - d )
1
(A.13)
1
- w’D) . - w D ) ] cos [2(A+ 1
(A.14)
Equations A.13 and A.14 are the complex cell responses to a stimulus with disparity D, for the position-shift- and the phase-parameter-based RF models, respectively. Unlike simple cell response functions in equations A.6 and A.8, both of these complex cell response functions are independent of the stimulus Fourier phase. Complex cells should therefore have reliable disparity tuning curves. This conclusion is true for both the position-shift- and the phase-parameter-based RF models. To investigate CDs of complex cells, we need to derive their disparity tuning curves for spatial noise patterns and sine wave gratings. A noise pattern has a broad Fourier spectrum, and its Fourier amplitude p ( w ) is a constant p independent of w.On the other hand, the Fourier transform of a sine wave grating contains only two frequency components. For a grating with frequency [I, its transform is given by 1
+
(A.15) F(sin(Rx))= -(6(Q w)- 6(0 - w)) 2 where 6() is the Dirac &functionand is nonzero only when its argument is zero. Using these properties in conjunction with equations A.13 and A.14,
1640
Yu-Dong Zhu and Ning Qian
it is easy to derive equations 2.11 a n d 2.12 for the position-shift-based RF model and equations 2.13 a n d 2.15 for the phase-parameter-based RF model. It should be pointed out that our analyses include approximations of equations A . l l a n d A.12. Their validity has been confirmed by our computer simulations. Actually, these two approximations are not necessary for deriving tuning curves to sinusoidal gratings. The special property of these stimuli, shown in equation A.15, makes it possible to derive their tuning curves (equations 2.12 a n d 2.15) exactly. This explains why the grating disparity tuning curves predicted by o u r analyses are almost indistinguishable from o u r simulation results.
Acknowledgments
~~
Wc thank Terry Sejnowski a n d Alex I’ouget for helpful discussions. We are also grateful to the anonymous reviewers for their comments. The work is supported by a research grant from the McDonnell-Pew Program in Cognitive Neuroscience a n d NIH grant MH54125, both to N.Q.
References
-
-
~
_
_
Adelson, E. H., and Bergen, J. R. 1985. Spatiotemporal energy models for the perception of motion. I . Opt. SOC.Ani.A2(2), 284-299. Albrecht, D. G., and Geisler, W. S. 1991. Motion sensitivity and the contrastresponse function of simple cells in the visual cortex. Visiinl Nriirosci. 7, 531-546. Bishop, P. O., Henry, G. €-I., and Smith, C. J. 1971. Binocular interaction fields of single units in the cat striate cortex. 1. Plisiol. 216, 39-68. Bishop, P. O., and Pettigrew, J. D. 1986. Neural mechanisms of binocular vision. Visicitz R P ~ 26, . 1587-1600. DeAngelis, G. C., Ohzawa, I., and Freeman, R. D. 1991. Depth is encoded in the visual cortex by a specialized recepti1.e field structure. Nntirrt, 352, 156-159. Ferster, D. 1981. A comparison of binocular depth mechanisms in areas 17 and 18 of the cat visual cortex. J . Plrsicil. 311, 623455. Fleet, D., Heeger, D., and Wagner, H. 1995. Computational model of binocular disparity. frzzvst. Opthnlriiol. n17d Vis. Sci. Suppl. (ARVO) 36(4), 365. Freeman, R. D., and Ohzawa, I. 1990. On the neurophysiological organization of binocular vision. Visiorr Res. 30, 1661-1676. Grzywacz, N. M., and Yuille, A. L. 1990. A model for the estimate of local image velocity by cells in the visual cortex. Proc. R. Soc. Larid A239, 129-161. Heeger, D. J. 1992. Normalization of cell responses in cat striate cortex. Visrinl Nrrrrosci. 9, 181-197. Hubel, D. H., and Wiesel, T. 1962. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. 1. Plisiol. 160, 106-154. Jones, J. P., and Palmer, L. A. 1987. The two-dimensional spatial structure of simple receptive fields in the cat striate cortex. 1. Ne~rropliysiol.58, 1187-1211,
Binocular Receptive Field Models and Disparity Tuning
1641
Marr, D., and Poggio, T. 1979. A computational theory of human stereo vision. Proc. R. SOC.Lond. B204, 301-328. Maske, R., Yamane, S., and Bishop, P. 0. 1984. Binocular simple cells for local stereopsis: Comparison of receptive field organizations for the two eyes. Vision Res. 24, 1921-1929. Ohzawa, I., DeAngelis, G. C., and Freeman, R. D. 1990. Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science 249, 1037-1041. Pettigrew, J. D. 1993. Two ears and two eyes. Nature 364, 756-757. Poggio, G. F., and Fischer, B. 1977. Binocular interaction and depth sensitivity in striate and prestriate cortex of behaving rhesus monkey. J. Neurophysiol. 40, 1392-1405. Poggio, G. F., and Poggio, T. 1984. The analysis of stereopsis. Ann. Rev. Neurosci. 7,379412. Qian, N. 1994a. Computing stereo disparity and motion with known binocular cell properties. Neural Comp. 6, 390404. Qian, N. 1994b. Stereo model based on phase parameters can explain characteristic disparity. SOC.Neurosci. Abs. 20, 624. Qian, N., and Andersen, R. A. 1996. A physiological model for motion-stereo integration and a unified explanation of the Pulfrich-like phenomena. Vision Res. (In Press). Qian, N., and Zhu, Y. 1995. Physiological computation of binocular disparity. SOC.Neurosci. Abs. 21, 1507. Schiller, P. H., Finlay, 8. L., and Volman, S. F. 1976. Quantitative studies of single-cell properties in monkey striate cortex: I. Spatiotemporal organization of receptive fields. J. Neurophysiol. 39, 1288-1319. Shapley, R., and Lennie, P. 1985. Spatial frequency analysis in the visual system. Ann. Rev. Neurosci. 8, 547-583. Wagner, H., and Frost, 8. 1993. Disparity-sensitive cells in the owl have a characteristic disparity. Nature 364, 796-798. Wagner, H., and Frost, B. 1994. Binocular responses of neurons in the barn owl’s visual Wulst. J. Comp. Pkysiol. A174, 661-670. Watson, A. B., and Ahumada, A. J. 1985. Model of human visual-motion sensing. J. Opt. SOC.Am. A2, 322-342.
Received December 14, 1995; accepted March 27, 1996.
This article has been cited by: 2. Eric K. C. Tsang, Bertram E. Shi. 2009. Disparity Estimation by Pooling Evidence From Energy Neurons. IEEE Transactions on Neural Networks 20:11, 1772-1782. [CrossRef] 3. Eric K. C. Tsang, Bertram E. Shi. 2008. Normalization Enables Robust Validation of Disparity Estimates from Neural PopulationsNormalization Enables Robust Validation of Disparity Estimates from Neural Populations. Neural Computation 20:10, 2464-2490. [Abstract] [PDF] [PDF Plus] 4. Jenny C.A. Read , Bruce G. Cumming . 2004. Understanding the Cortical Specialization for Horizontal DisparityUnderstanding the Cortical Specialization for Horizontal Disparity. Neural Computation 16:10, 1983-2020. [Abstract] [PDF] [PDF Plus] 5. Yuzhi Chen, Ning Qian. 2004. A Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and Position-Shift Receptive Field MechanismsA Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and Position-Shift Receptive Field Mechanisms. Neural Computation 16:8, 1545-1577. [Abstract] [PDF] [PDF Plus] 6. Eric K. C. Tsang, Bertram E. Shi. 2004. A Preference for Phase-Based Disparity in a Neuromorphic Implementation of the Binocular Energy ModelA Preference for Phase-Based Disparity in a Neuromorphic Implementation of the Binocular Energy Model. Neural Computation 16:8, 1579-1600. [Abstract] [PDF] [PDF Plus] 7. Melissa Dominguez , Robert A. Jacobs . 2003. Developmental Constraints Aid the Acquisition of Binocular Disparity SensitivitiesDevelopmental Constraints Aid the Acquisition of Binocular Disparity Sensitivities. Neural Computation 15:1, 161-182. [Abstract] [PDF] [PDF Plus] 8. Jenny C. A. Read . 2002. A Bayesian Approach to the Stereo Correspondence ProblemA Bayesian Approach to the Stereo Correspondence Problem. Neural Computation 14:6, 1371-1392. [Abstract] [PDF] [PDF Plus] 9. B. G. Cumming, G. C. DeAngelis. 2001. THE PHYSIOLOGY OF STEREOPSIS. Annual Review of Neuroscience 24:1, 203-238. [CrossRef]
Communicated by Alain Destexhe
Response Characteristics of a Low-Dimensional Model Neuron Bo Cartling Department of Theoretical Physics, The Royal Institute of Technology, S-100 44 Stockholm. Sweden
It is shown that a low-dimensional model neuron with a response time constant smaller than the membrane time constant closely reproduces the activity and excitability behavior of a detailed conductance-based model of Hodgkin-Huxley type. The fast response of the activity variable also makes it possible to reduce the model to a one-dimensional model, in particular for typical conditions. As an example, the reduction to a single-variable model from a multivariable conductance-based model of a neocortical pyramidal cell with somatic input is demonstrated. The conditions for avoiding a spurious damped oscillatory response to a constant input are derived, and it is shown that a limit-cycle response cannot occur. The capability of the low-dimensional model to approximate higher-dimensional models accurately makes it useful for describing complex dynamics of nets of interconnected neurons. The simplicity of the model facilitates analytic studies, elucidation of neurocomputational mechanisms, and applications to large-scale systems. 1 Introduction
A low-dimensional model neuron was recently (Cartling 1995b) derived as intermediate in complexity between the most abstract models of Hopfield type (Hopfield 1982, 1984; Hertz et al. 1991) and the most detailed conductance-based models of Hodgkin-Huxley type (Hodgkin and Huxley 1952; for a recent example see Ekeberg et al. 1991; and Fransh and Lansner 1995). There have been several formulations of lowerdimensional systems of equations to describe the development of action potentials (FitzHugh 1961; Nagumo et al. 1962; Hindmarsh and Rose 1984; Rose and Hindmarsh 1989; Abbott and Kepler 1990; Kepler ef al. 1992; Doya and Selverston 1994). Different approaches to reduce detailed to firing-rate models have been discussed in general (Abbott 1990) and for nets of neurons interconnected via slow synapses (Rinzel and Frankel 1992; Ermentrout 1994). The relation of previous approaches to the present model is discussed in Section 4. The reduced model in Cartling (1995b) is based on the averaging of rapidly varying variables Neural Computation 8,1693-1652 (1996) @ 1996 Massachusetts Institute of Technology
1644
Bo Cartling
over an action potential, resulting in a description in terms of only two variables. One is the activity, that is, the firing rate, and the other corresponds to the excitability. The central role of intracellular calcium concentration in regulating the firing rate (Hille 1992) makes it a useful excitability variable. Intracellular calcium affects the conductance of calcium-sensitive potassium channels, which control the slow afterhyperpolarization of action potentials and thereby the firing rate. The strength of the coupling between neuronal activity and excitability, that is, the neuronal adaptability, may serve as a dynamics control parameter for nets of interconnected neurons (Cartling 1993, 1994, 19951, 1996). Complex limit cycles or chaotic behavior may result at strong adaptability and simpler limit-cycle and fixed-point attractors at intermediate and low adaptability, respecti\dy. In the brain, 1,arious neuromodulators can regulate the adaptability. In this work, the response characteristics of the reduced model neuron are inLwtigated. 2 Neuronal Response Characteristics
.-
The abstract model neuron is defined in terms of two variables corresponding to activity and excitability. The activity variable is given by the firing rate of action potentials, and intracellular calcium concentration is selected as an excitability variable (Cartling 199%). This low-dimensional model is described by the following system of equations:
where f (t ) and c (t ) denote firing rate and intracellular calcium concentration, rtspectively, and ii t ) is an input current. g ( i ( t )c. ( t ) )is a generalized neuronal activation function defined as the steady-state activity at a given input and intracellular calcium concentration, and T is a response time constant by which a steady state is reached. q ( c ( t ) )is the small increase o f intracellular calcium concentration due to the inward calcium current during one action potential, and T, is the time constant describing the restoration of the resting-state concentration of intracellular calcium by ionic pumps, internal buffering, and diffusion. This resting-state concentration is very low (Yamada ef al. 1989) and can be ignored. In Cartling (1995b), a generalized neuronal activation function was derived from a multicompartment conductance-based model with ionic currents described by Hodgkin-Huxley-type equations and parameter values recently determined for a neocortical pyramidal cell of the regularly spiking type (Fransen and Lansner 1995). It is of a dynamical threshold form, with the dynamics of the threshold given by the intracellular
Response Characteristics of a Low-Dimensional Model Neuron
1645
calcium dynamics. In the present work we select the approximate analytical expression
g ( i ( t ) .c ( t ) )= go(i(t)- m ( t ) - E ) ”
(2.3)
for i ( t ) - m ( t ) - E 2 0 and g ( i ( t ) . c ( t ) ) = 0 elsewhere. (T and E are adaptability and threshold parameters, respectively. The activation function g ( i ( f )c. ( t ) ) numerically obtained for the neocortical pyramidal cell (Cartling 1995b) is well fit using the parameter values go = 254.7, (T = 1.026, E = 0.12, and p = 0.3, where g is measured in s-l, i in nA, and c in arbitrary units. By determining q ( c ( t ) )from the steady-state solutions of the full conductance-based model for a range of input currents, the simple approximation
is obtained with parameter values a = 0.11 and k = 0.9. Employing these approximations, the response characteristics of the reduced and conductance-based models are compared in Figure 1 for different types of input current and different response time constants of the reduced model. The stepwise changes of input current simulate current injection experiments, and the smoothly varying input has a frequency of 5 Hz, which is typical of theta oscillations in the brain. In the conductance-based model, input current is applied to the soma compartment as in experimental situations. The firing rate of the conductancebased model is defined at the midpoints of interspike intervals as the inverse of the corresponding interspike intervals. The response time of the conductance-based model is typically, that is, in other than completely resting states, shorter than the membrane time constant r,,, = c,,l/g,, where c, is the membrane capacitance and g1 is the leak conductance. By definition rWl is constant, and in the conductance-based model of a neocortical pyramidal cell (Fransen and Lansner 1995), its value is 23 ms. One reason that the response time constant is smaller than the membrane time constant is that it reflects the total conductance rather than the leak conductance, and thus is smaller in an active than in a resting state due to increased membrane conductance when many ion channels are open. Another reason is that even at a low firing rate, the membrane potential in between spikes is much closer to the threshold for firing than it is in the completely resting state. The results in Figure 1 demonstrate that it is possible to reproduce the activity and excitability behavior of the detailed conductance-based model by the reduced model with a response time that is shorter than the membrane time constant. This is particularly true in not completely resting states and for smoothly changing input, that is, for typical neurophysiological conditions. Except for the details of transient behavior during a time on the order of an interspike interval, the reduced model works well also for an initially resting state and for
Bo Cartling
1646
rapidly changing input. The reduced model is thus generally applicable at the time resolution for which it is formulated. The limit T = 0 corresponds to the activity instantaneously taking the value of the activation function, which means that the reduced model becomes one-dimensional, as described by equation 2.2 and f(t ) = g( i( t j . c( t ) ). The good agreement between the two models in both Figures l a and I d thus indicates that the multivariable conductance-based model can be reduced even to a single-variable model. In a model net of interconnected neurons, the input governed by synaptic transmission and dendritic propagation can be included as a second variable for a neuron. 3 Phase-Plane Analysis
~-
A transient oscillatory response to a constant input is seen in Figure lc, and we investigate the conditions for its appearance. A stability analysis of the reduced model for a constant input i ( t ) = i,, is based on a linearization around a fixed point solution for which f (t ) = j,and c(t j = e(,. Introducing f’(f 1 = f I t 1 - j , and c’( t ) =: c(t ) - c,,, the linearized equations a re
from which it follows that fixed points can be only stable foci or nodes; that is, oscillatory solutions for .f-, > ( o - h)’/4 are always damped because ( o + 6 j 3 O and nonoscillatory solutions for .h5 ( o - h)’/4 can only decay since ( ( I + 6112 > ! ( ( I - ;4)’/4 - f-,]’’2. From equation 3.3. the upper limit of the response time T of the reduced model, for which a transient oscillatory solution cannot appear, is obtained as T;,
=
‘-2;
+
b - 21?;
- +f;
‘
1 2j,)fi2.
(3.4)
where + = $(icl.cc,)q(cc,).It varies between T ! ! = 6 ms for i,, = 1.0 nA and T!,==5 jrs for i,, = 0.2 nA. To prove that a limit cycle cannot exist also beyond the linear regime, Bendixon’s negative criterion is employed. Consider a closed trajectory
Response Characteristics of a Low-Dimensional Model Neuron
1647
la1
zw
200 f
f
1W
103
0 1.0
1.0
0
0.5E
0.0 1.0
iC1
2w
zw
I
f 1W
0 10
0.5E 0.0 1.0
t
t
Figure 1: The response characteristics of the reduced and conductance-based models of a neocortical pyramidal cell with externally applied current as shown in the lower panels of (a)-(d). The upper panels show the firing rate as a function of time by solid curves for the reduced model and by small circles for the conductance-based model. The circles are located at the midpoints of interspike intervals and indicate the inverse of the corresponding interspike intervals. The middle panels show the intracellular calcium concentration as a function of time by dashed curves for the reduced model and by solid curves for the conductance-based model. The response time constant T of the reduced model is (a) 0, (b) 1, (c) 5, and (d) 0 ms. Units are ssl forf, nA for i, ms for f, and arbitrary for c.
Bo Cartling
1648
in the i f . c) plane. The vector field A G(f. c ) is defined by
cifo = F(.f(t).c(f)) lft
== ( F ( f .c). C(f.
c ) ) , where F ( f . c) and
(3.5)
is parallel with the tangent of the trajectory and thus orthogonal to the normal vector ii. Gauss' theorem in the plane then yields
',
Since i I F [ f . c ) / d f = -T d G ( f . c ) / i ) c == dq(c)/dcf - T,-' and d q ( c ) / d c < 0, f 2 0, it follows that i)F!f. c ) / i ) f + i)G(f.c ) / i k < 0, and equation 3.7 thus implies that a closed trajectory cannot exist. Gauss' theorem in the plane and ( / / / , < A 0 also show that the system is everywhere convergent. The analysis above is illustrated by the ( f .c ) phase-plane diagrams in Figure 2. The nullclines 4 f ( t ) / r l t= 0 and rlc(f)/dt = 0 according to equations 2.1 and 2.2 are drawn for different constant values of the input current ii t ) in Figures 2a and 2b. The nullclines intersect at the steady-state solution for a given input current and divide the phase plane into four regions characterized by distinct pairs of signs o f $f(t),"it and d c ( t ) / d t . For a zero input current, the steady-state solution is located at (0.0). Figure 2a also depicts the trajectories corresponding to the first part of Figures la-c where a constant input current of 0.5 nA is applied in an initially resting state. The trajectory for the response time constant T = 0 starts at (0.0), follows thef-axis to the nullcline [ l f ( t ) / d t = 0, which it then follows to the steady-state point. At increased T the trajectory successively departs from that for T = 0, and the small damped oscillation for T = 5 ms can also be seen. Figure 2b analogously corresponds to the last part of Figures la-c, that is, to a constant input current of 0.8 nA starting from the steady-state solution at i ( t l = 0.5 nA. These diagrams visualize the reduction to a one-dimensional model at low values of T as the approach of the trajectory to the nullcline ( f ( t ) / d t = 0. 4 Discussion
The present analysis of the response characteristics of a low-dimensional model neuron demonstrates that the model may be a useful approximation of detailed conductance-based models. The reduction of dimension is based on a separation of fast and slow dynamical variables, a procedure of general applicability to dynamical systems described by coupled differential equations (Haken 1983). In a first step, dynamical variables fluctuating rapidly over an action potential are averaged, and only variables
Response Characteristics of a Low-Dimensional Model Neuron
1649
Figure 2: Phase-plane diagrams of the reduced model for constant input currents. The nullclines d f ( t ) / d t = 0 and d c ( t ) / d t = 0 according to equations 2.1 and 2.2 are drawn as solid and dashed lines, respectively, and their intersection represents a steady-statesolution. Trajectories describing the evolution upon application of a constant input current are shown using long-dashed, dot-dashed, and dotted lines for the response time constant T = 0, 1, and 5 ms, respectively. (a) An input current of 0.5 nA and an initially resting state represented by (0.0). This corresponds to the first part of Figures la-c. The trajectory for T = 0 starts at (O.O), follows thef-axis to the nullcline d f ( t ) / d t = 0, which it then follows to the steady-state point. (b) An input current of 0.8 nA and an initial state given by the steady state in (a), which is marked by a small circle. This corresponds to the last part of Figures l a c Units are s-' forf and arbitrary for c.
changing on time scales of interspike intervals and beyond are retained explicitly. Activity and excitability are identified as the most significant variables of the latter type. Among these, activity is the faster one, and excitability can serve as an order parameter in an ultimate reduction to a one-dimensional model. Reduced model neurons go back to the FitzHugh-Nagumo model (FitzHugh 1961; Nagumo et al. 1962), which is phenomenologically formulated to display the basic action potential characteristics of the Hodgkin-Huxley model rather than to describe neurophysiological mechanisms. In developments of this model, a slow adaptation variable is introduced in Hindmarsh and Rose (1984) and replaced by a slow voltagedependent outward current in Rose and Hindmarsh (1989), but the nature of the latter current is not identified. In Abbott and Kepler (1990) and Kepler et al. (1992), systematic reduction schemes are derived in terms of auxiliary and equivalent potentials, respectively, for systems described by voltage-dependent gating variables. In Doya and Selverston (1994), an artificial neural network is employed for the reduction of a
Bo Cartling
1650
six-dimensional bursting neuron model, and the slow variable is found to correlate with intracellular calcium concentration. Among reduced models formulated in terms of firing rate, a modified FitzHugh-Nagumo model is adopted for that type of description in Abbott (1990), and a slow component is extracted from the membrane current. For nets of interconnected neurons with slow synaptic conductances, the latter have been selected as slow variables (Rinzel and Frankel 1992; Ermentrout 1994). In the present firing-rate model, excitability is explicitly selected as the slow variable, and a generalized neuronal activation function is used for the faster activity variable. The representation of excitability by intracellular calcium concentration also has the advantage of relating the description to quantities that are directly accessible experimentally. In models of nets of interconnected neurons, the low dimension of the model neuron becomes particularly important. The single variable in the one-dimensional model neuron can be complemented by an input \ x i a b l e governed by svnaptic transmisson and dendritic propagation, an approach employed in Cartling (1996). Viewed as an extension of the most abstract models of Hopfield type, the present approach allows a description of more o f the rich dynamics of nets of interconnected neurons. This is primarily due to the inclusion of the firing-rate adaptation into the neuronal properties (Cartling 1993, 1994, 1995~1, 1996). In relation to conductance-based models of Hodgkin-Huxley type, the reduction of the number of dynamical variables is considerable. For the model of a neocortical pyramidal cell studied in this work, it is a reduction from fourteen dynamical Lwiables to one. Such a strong reduction of the number of degreees of freedom facilitates analytic studies, insight into neurocomputational mechanisms, and applicability to large-scale systems. The simplicity of the model, in combination with its capability to approximate higher-dimensional models accurately, makes it a promising tool for the exploration of complex dynamics of neural systems. A simple model neuron capable of generating complex network dynamics may also be useful for neural computations in practical applications of artificial neural networks. Acknowledgments -
~~
This work was supported by the Swedish Natural Science Council. References Abbott, L. F. 1990. A network of oscillators. I. P l y s . A: Mntli. Geii. 23, 383553859, Abhott, L. F., and Kepler, T. B. 1990. Model neurons: From Hodgkin-Huxley to Hopfield. In Stotistiini i l l ~ ~ l i a i i iqfNeiirn/ cs Nrfiuorks, L. Garrido, ed., pp. 5-18. Springer Verlag, Barcelona.
Response Characteristics of a Low-Dimensional Model Neuron
1651
Cartling, B. 1993. Control of the complexity of associative memory dynamics by neuronal adaptation. Int. J. Neural Syst. 4, 129-141. Cartling, B. 1994. Generation of associative processes in a neural network with realistic features of architecture and units. Int. J. Neural Syst. 5, 181-194. Cartling, B. 1995a. Autonomous neuromodulatory control of associative processes. Network 6, 247-260. Cartling, 8. 1995b. A generalized neuronal activation function derived from ion channel characteristics. Network 6, 389401. Cartling, B. 1996. Dynamics control of semantic processes in a hierarchical associative memory. Biol. Cybern. 74, 63-71. Doya, K., and Selverston, A. I. 1994. Dimension reduction of biological neuron models by artificial neural networks. Neural Comp. 6, 696-717. Ekeberg, O., Wallen, P., Lansner, A,, TrdvCn, H., Brodin, L., and Grillner, S. 1991. A computer based model for realistic simulations of neural networks I: The single neuron and synaptic interaction, B i d . Cybern. 65, 81-90. Ermentrout, B. 1994. Reduction of conduccance-based models with slow synapses to neural nets. Neural Comp. 6, 679-695. FitzHugh, R. 1961. Impulses and physiological states in theoretical models of nerve membrane. Biophys. J. 1,445466. FransCn, E., and Lansner, A. 1995. Low spiking rates in a population of mutually exciting pyramidal cells. Network 6, 271-288. Haken, H. 1983. Synergetics: An Introduction. Noneqiiilibriurn Phase Transitions and Self-Organization in Physics, Chemistry, and Biology. Springer Verlag, Berlin. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. Introduction to the Theory of Neirrnl Computation. Addison-Wesley, Redwood City, CA. Hille, B. 1992. Ionic Channels of Excitable Membranes. Sinauer, Sunderland, MA. Hindmarsh, J. L., and Rose, R. M. 1984. A model of neuronal bursting using three coupled first order differential equations. Proc. R. SOC.Lond. B 221, 87-102. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 500-544. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U S A 79, 2554-2558. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U S A 81,3088-3092. Kepler, T. B., Abbott, L. F., and Marder, E. 1992. Reduction of conductance-based neuron models. B i d . Cybern. 66, 381-387. Nagumo, J., Arimoto, S., and Yoshizawa, S. 1962. An active pulse transmission line simulating nerve axon. Proc. IRE So, 2061-2070. Rinzel, J., and Frankel, P. 1992. Activity patterns of a slow synapse network predicted by explicitly averaging spike dynamics. Neural Comp. 4, 534-545. Rose, R. M., and Hindmarsh, J. L. 1989. The assembly of ionic currents in a thalamic neuron. I. The three-dimensional model, 11. The stability and state diagrams, 111. The seven-dimensional model. Proc. R. Soc. Lond. B 237, 267-334.
Bo Cartling
1652
Yaniada, W. M., Koch, C., and Adams, P. R. 1989. Multiple channels and calcium dynamics. In Mrtliods i i i Ncwroi~n[Moifding. From Syrzapses to Net-cmrks, C. Koch, and 1. Segev, eds., pp. 97-133. MIT Press, Cambridge, MA. ~
Received May 12, 1995, accepted Februarv 20, 1996
This article has been cited by: 2. Pierre A. Fortier , Emmanuel Guigon , Yves Burnod . 2005. Supervised Learning in a Recurrent Network of Rate-Model Neurons Exhibiting Frequency AdaptationSupervised Learning in a Recurrent Network of Rate-Model Neurons Exhibiting Frequency Adaptation. Neural Computation 17:9, 2060-2076. [Abstract] [PDF] [PDF Plus]
Communicated by Carl van Vreeswijk
What Matters in Neuronal Locking? Wulfram Gerstner Pkysik-Department der TU Miincken, 0-85747 Garcking bei Miincken, Germany
J. Leo van Hemmen Physik-Department der TU Miinchen, 0-85747 Garching bei Miinchen, Germany* Department of Mathematics, University of Chicago, Chicago, I L 60637 U S A
Jack D. Cowan Department of Mathematics, University of Chicago, Chicago, IL 60637 U S A Exploiting local stability, we show what neuronal characteristics are essential to ensure that coherent oscillations are asymptotically stable in a spatially homogeneous network of spiking neurons. Under standard conditions, a necessary and, in the limit of a large number of interacting neighbors, also sufficient condition is that the postsynaptic potential is increasing in time as the neurons fire. If the postsynaptic potential is decreasing, oscillations are bound to be unstable. This is a kind of locking theorem and boils down to a subtle interplay of axonal delays, postsynaptic potentials, and refractory behavior. The theorem also allows for mixtures of excitatory and inhibitory interactions. On the basis of the locking theorem, we present a simple geometric method to verify the existence and local stability of a coherent oscillation. 1 introduction Coherence may be defined as being “united in relationship” for most vertebrate neurons, meaning a temporal relationship in that they fire in unison. As such, it is another way of saying that neurons get locked. Once the proposal appeared that coherent oscillations may exist in biological neural systems (Eckhorn et al. 1988; Gray and Singer 1989; Gray et al. 1989; Engel et al. 1991a, 1991b; Eckhorn et al. 1993; Gray 1994), locking phenomena attracted a considerable amount of interest and spurred quite a few people to explain or disprove the very existence of coherent oscillatory activity. Different authors have used differing models, which vary in several aspects, as do the assumptions and the results. Some models show perfect locking, others partial locking or no locking at all. Some use excitatory interactions, some exploit inhibitory ones, and others use a mixture. In this paper, we present a unifying framework that allows one to derive exact conditions for the existence and stability of coherent *permanent address Nvuval Coniputatiorr 8, 1653-1676 (1996) @ 1996 Massachusetts Institute of Technology
1654
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
solutions in a network of spiking neurons and to isolate the neuronal characteristics that are essential to them. The result is surprisingly simple: Perfect locking is possible only if firing occurs while the contribution evoked by incoming pulses (i.e., the postsynaptic potentials) is increasing in time. A more precise formulation is given in the next section, where we show how a subtle interplay of axonal delays, postsynaptic potentials, and refractory behavior can lead to coherence. This result can be applied to excitatory or inhibitory couplings or homogeneous mixtures thereof and solves the often-posed question of whether excitation or inhibition is ”more suitable” to support collective oscillations (van Vreeswijk et a/. 1994; Lytton and Sejnowski 1991). In fact, for spiking neurons, this kind of collective behavior seems to be generic. Furthermore, we present a purely geometric method to verify whether a coherent oscillation can exist and, if so, whether it is stable. In view of the truly extensive and diverse literature, we think a unifying framework meets an urgent need. In this paper, we concentrate on analytic results for model networks of spiking neurons (Mirollo and Strogatz 1990; Kuramoto 1991; Gerstner and van Hemmen 1992, 1993; Gerstner rt al. 1993; Abbott and van Vreeswijk 1993; Bauer and Pawelzik 1993; Tsodyks et al. 1993; Treves 1993; Usher t>t al. 1993; van Vreeswijk et 01. 1994; Gerstner 1995; Ernst ct al. 1995; Hansel et al. 1995). We mostly focus on large networks, although our technique can also be applied to small sets of neurons such as central pattern generators (cf. Skinner et al. 1994). We neither consider phase models (Abbott 1990; Schuster and Wagner 1990a; Sompolinsky et al. 1990; Niebur et RI. 1991; Golomb et al. 1992)nor analyze simulation studies (Buhmann 1989; Bush and Douglas 1991; Lytton and Sejnowski 1991; Schuster and Wagner 1990b; Konig and Schillen 1991; Schillen and Konig 1991; yon der Malsburg and Buhmann 1992; Engel e f al. 1992; Deppisch ct 01. 1993; Nischwitz and Gliinder 1995; Ritz et a / . 1994). Furthermore, we do not comment on the debate concerning the interpretation and potential relevance of coherent states since there are already many papers arguing the issue (Eckhorn ef nl. 1988; Gray e t a / . 1989; Engel et al. 1991a; Schuster and Wagner 1990b; Konig and Schillen 1991; von der Malsburg and Buhmann 1992; Ritz rt nl. 1994. Cf. in particular von der Malsburg 1994; von der Malsburg and Schneider 1986; Singer 1994). In order to prove our locking result, we will use the framework of the spike response model (Gerstner 1991; Gerstner and van Hemmen 1992, 1993; Gerstner 1991, 1995; Kistler et al. 1996). In this model, the effects of spike emission and spike reception are described by two response kernels: / I ,to represent a spike and the resulting refractory behavior, and :, to take into account the response of a neuron once a spike has arrived at a synapse on its dendritic tree. If a presynaptic neuron j fires at a time t i , a response will be evoked at the soma of a postsynaptic neuron i, which we describe by I,, :(t t:). The synaptic weight I,, is a measure of the amplitude o f the response. Similarly, if the neuron i fires at a ~
What Matters in Neuronal Locking?
1655
time t i , the repolarization after the pulse usually causes a sharp drop of the membrane potential. This effect is summarized by an additive contribution 7/(t - t,f ) 5 0 to the membrane potential. Typical examples of E and q can be found in Figures l a and lb, whereas a more elaborate structure is shown in Figures lc and Id. A neuron model is said to have a standard dynamics if drllds 2 0 for all s > 0. This includes integrate-andfire, fast spiking, and adaptive neurons but excludes intrinsic bursters (cf. Connors and Gutnick 1990 for a classification of neuronal firing patterns). A neuron model with 7/(s) = E ( S ) = 0 for s 2 2T will be called a model with short-term memory. Here T is the period of a network oscillation, to be studied below. For the sake of simplicity we will assume throughout this paper that the delay A, between neuron j and neuron i depends on neither i nor j . Hence A, = A and the delay can be incorporated in the function E . The total membrane potential at the soma of neuron i can then be written
f
I
f
Due to causality, we have r/(s) = 0 for s < 0 and E ( S ) = 0 for s < A (cf. Fig. la-c). A neuron fires once its membrane potential k(t) reaches a threshold 6 from below. This condition defines the firing times tf and is at the basis of our formalism. For the moment we do not include noise so as to simplify the ensuing arguments even further. Before turning to the proof of our locking theorem in Section 4, we illustrate its potentialities by presenting a purely geometric method to construct and verify the stability of a coherent oscillation in Section 2. We indicate the relation between the present setup and the usual integrateand-fire models in Section 3. With respect to locking, it hardly makes any difference whether one uses excitatory or inhibitory couplings. As we will show in Section 2, the geometric method makes such a statement obvious. In Section 5 we return to this fact, which at first sight is surprising, and summarize our findings. 2 Geometric Method
In Section 4 we will prove a locking theorem, which is instrumental to understanding neuronal coherence. In this section we take it as the starting point of a purely geometric method that allows one to construct and directly verify the stability of a coherent oscillation. Here is a theorem that relates neuronal characteristics to asymptotic stability, that is, when perturbations of a limit state decay to zero. Most of the time we will simply say that something is stable, meaning that it is asymptotically stable. Precise conditions and extensions will be spelled out in the next section.
1656
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
Figure 1: Typical response kernels. (a) Refractory kernel I/. The spike generated at time t i is indicated by the arrow. After the spike, there is a period of hyperpolarization that decays over 20 ms. (b) Response kernel 5 . The graph with s = t -- f/ exhibits the typical time course of an excitatory postsynaptic potential that is evoked with a delay A = 2 ms after a presynaptic spike of neuron j at time t = t,f (arrow). The response has been taken at neuron i. For s > A, we have plotted the function E(S) = exp[-(s A)/rlll]{l - exp[-(s - A ) / T , ~ ~ ] } representing a postsynaptic potential for excitatory synaptic input with synaptic time constant r,,,, = 4 ms and membrane time constant 7,,, = 10 ms. (c) A more elaborate refractory kernel (with four different time constants referring to four differention channels) gives rise to intrinsic bursting (d), which is a direct consequence of the subsequent hyperpolarization and depolarization exhibited by r/. In (d), a neuron with threshold i I = 0.1 receives a constant input current. The membrane voltage has been gi\ren in arbitrary units. -
What Matters in Neuronal Locking?
T
1657
t
Figure 2: Geometric method: Excitatory couplings. All active neurons have fired at t = 0. The next spike occurs if J " E ( t ) (solid line) crosses the decreasing threshold .Iy q ( t ) (dashed). We have sketched two situations: short (A,) and long delay (A, > Al). The coherent oscillation is stable for excitatory couplings with relatively long delays but not for short delays; stable and unstable have been denoted by (s) and (u), respectively. ~
Locking theorem. In a spatially homogeneous network of spiking neurons with standard dynamics, a necessary and, in the limit of a large number n of presynaptic neurons ( n + co),also suflcienf conditionfor a coherent oscillation to be asymptotically stable is fhatfiring occurs when the postsynaptic potential arising from all previous spikes is increasing in time. Let us now turn to Figure 2. The horizontal axis is the time axis, and the vertical axis displays the response of a "typical" neuron. The network under consideration has excitatory interactions only. Each neuron has short-term memory and receives input from n >> 1 other neurons through synaptic weights Jo/n;the normalization by l / n is just convenient. We suppose that all neurons fire at time t = 0. Each neuron then feels its refractory field 7 . The action potentials have disappeared into the axons, but after a delay of A ms they reappear at the dendritic trees and induce a response at the soma, which is described by the function E . If the postsynaptic potential at the soma reaches the threshold 7!l of the neuron, so that (Jo/n)n x E ( S ) ~ ( s =) 19 or, equivalently, Jo E ( S ) = 1y - T ~ ( S ) , then all the neurons will fire again. This leads to a simple graphic solution for T. As is evident from the plot, in firing again, a neuron still feels its refractory field. If the delay A is too short, the point of intersection of E ( S ) and 19 - ~ ( sis ) in the descending part of E , and no stable oscillation can arise. If, however, A is a bit longer, then the point of intersection of the two curves is in the ascending part of E, and a coherent oscillation
+
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
1658
is stable. Once we know the locking theorem, existence and stability can indeed be verified geometrically. The inhibitory case of Figure 3 does not provide any additional difficulty. It is plain that, to get a response from this purely inhibitory system, we need a stimulus I,, > 0. Again we suppose that all (possibly selected) neurons fire at time t = 0. Of course, each neuron feels its refractory field 11. The action potentials disappear into the axons, but after a delay of _1 ms they reappear at the dendritic trees and induce a response at the soma via the function :In", which is now negative. The neurons will fire again, provided J,l:"'h(s) + lo = rl - r l ( 5 ) . For small A's or short-lived inhibitory potentials, the neuron still notices its refractory past and the point of intersection is in the ascending part of ?lh (Fig. 3a). If the delay lasts long enough, then r/ plays no role any more (Fig. 3b), and we are left with the condition I,1+ lo:Inh(s)= 11 and, hence, stability. In the presence of mere inhibition, the oscillation is stable for a wide range of delays 1-in contrast to the excitatory case, where the stability depends critically on A. Systems with both excitatory and inhibitory interactions are in general more interesting from a neurobiological point of view and will be treated in Section 5. Though it is a simple matter to play around with delays and parameters, we will not pursue this issue here and turn instead to the mathematics of our locking argument. Before delving into the details of the proof, whose geometric essence can be found in Figure 4, we quickly indicate the relation between the usual integrate-and-fire models and the spike response model as it is employed in this paper. 3 Relation to Integrate-and-Fire Models
In integrate-and-fire models, firing leads to an immediate reset of the membrane potential. We denote the membrane potential of an integrateand-fire neuron by & ( t ) and its threshold by r j . Firing occurs if h ( t ) = 3. This defines a firing time t,' and the reset requirement is - f limh(t, +t\)
+-fl
=
0.
Between two firings, the change of the membrane potential is given by the equation of a simple RC circuit charged by a current lo + [,it),
1,) is a constant external current that is identical for all neurons. The
time-dependent contribution is due to the input from other neurons,
As before, 11, is the synaptic weight representing the input amplitude. The function o ( s ) is the typical input current caused by a presynaptic
What Matters in Neuronal Locking?
1659
t
T
Figure 3: Geometric method: Weak (a) and strong (b) inhibitory couplings. All neurons have fired at t = 0. The next spike occurs if 10 + Joc(t) (solid line) crosses the decreasing effective threshold IY - a ( t ) (dashed line). In the case of strong and long-lasting inhibition, refractoriness has disappeared and, thus, ti already vanishes before the next spike is generated. The coherent oscillation is stable in both (a) and (b). spike. Choices of the function N include ~ ( s=) h(s), where h is the Dirac 6 function; Q(S) = 6(s - A), where A is a delay; ~ ( s =) s;lB(s)O(so - s), for a short square pulse where B ( s ) is the Heaviside unit step function; or n ( s ) = ( s / T * ) exp( - s / T ) , for a more realistic description of the synaptic input current that also obeys the pleasant normalization Jo"o(s)ds = 1. We note that the reset condition is equivalent to a current pulse -8 6(s) in equation 3.2. Since equation 3.2 is a linear differential equation, it can be integrated and yields
h;(t) = C7](t f
I'
exp I
f
[
-
Wulfram Gerstner, J. Leo van Hernmen, and Jack D. Cowan
1660
with (a prime always denoting a derivative)
and
The last term in equation 3.4 was adjusted 5 0 that the initial value of h, is f i , ( O ) = 0. We note that for t >= T the initial condition does not play any role, and the last term approaches the constant value 1". If we define h , ( t ) = Iz,(t) - 1,) and rl = I ) - lo, we are back at equation 1.1. We would like to emphasize that the spike response model (equation 1.1)is more general than the integrate-and-fire model (equation 3.2) in that we can use arbitrary response kernels c and r / . A typical example of these response kernels has been presented in Figure 1. 4 Locking
~
~
~
_
In the following subsections, we study a coherent state of a spatially homogeneous network of N neurons labeled by 1 5 i 5 N and construct this network state self-consistently in such a way that the period T follows directly. We first handle the existence and then turn to the stability of a coherent oscillation. The word cdzerenf should be constantly borne in mind because it plays a key role in both the existence and the stability proof. Once a homogeneous system of spiking neurons with short-term memory behaves coherently, it cannot but oscillate. As such, oscillations are not a deep network property but simply a consequence of the connectivity and the spike dynamics of neurons. In the present context, spatial homogeneity means that all neurons are of the same type; they have identical c and rl kernels, and have the same "gross" synaptic input: for all 1 5 i 5 N. ==
x,],i
4.1 Existence of Coherent Solutions. In a coherent state, all neurons of the network fire synchronously and with the same period T. For the sake of convenience we adjust the origin t = 0 so that regular firing occurs at i T with integer 1. Let us assume that neurons have fired regularly in the past t 5 0. More precisely, we assume that synchronous firing has occurred at t = /Twith i = 0. -1. -2.. . . . For 0 < t < T the membrane potential of neuron i is then given by
i -[I
,
i=o
What Matters in Neuronal Locking?
1661
The next coherent firing should occur at time t = T. This means that hi(t) reaches the threshold 6 at time t = T and, hence, yields a self-consistency requirement for T,
More precisely, T = inf{t > Olh,(t)= IY}. Since we have h,(t) < 6 for t < T, the membrane potential h,(t) reaches ,Iy from below, and thus h:(T) > 0. Usually the term P = 1 dominates the sum in equation 4.2, and we end up with the simple equation (4.3) which allows a straightforward graphic interpretation (cf. Figs. 2 and 3). Note that a delay A has been incorporated in E . An oscillatory solution exists if the two functions J ~ E ( S ) and 19 - ~ ( s cross ) at some point s’. If there are several crossing points, the first one (smallest s’) gives the oscillation period T = s’. For neurons with short-term memory, that is, with 7/(s) = E ( S ) = 0 for s 2 2T, equation 4.3 is exact. For a general neuron model with adaptation, however, memory lasts longer and we have to use equation 4.2 instead of 4.3. 4.2 Asymptotic Stability of Coherent Solutions. So far we have concentrated on the existence of coherent solutions. In the following we check whether the solutions are stable with respect to small perturbations; that is, we perform a linear stability analysis. To be specific, we consider a perturbation of the neuronal firing pattern as it occurred in the past t 5 0. In the unperturbed situation, all neurons would have fired synchronously up to t = 0, but now they do at times {PT hl(P);e = 0, -1. - 2 . . . and 1 5 i 5 N}. We assume IS,(8)1 << T since we perform a linear stability analysis. For t > 0, the membrane potential is no longer given by equation 4.1 but by
+
r
1
(4.4) At time t = T the actual firing is, in general, either slightly earlier or later, and neuron i fires at T h , ( l ) instead of T . The time shift h , ( l ) can be found from the threshold condition h,(T h , ( l ) ) = IY, given the past. We use equation 4.4, linearize with respect to all the 6,(8)in sight, and take advantage of the unperturbed threshold condition (equation 4.2). In order to simplify the ensuing notation, we introduce the abbreviations
+
+
(4.5)
1662
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
After a bit of algebra we then find
Here F is a linear map from the past 6 onto the present, that is, { b , ( l ) ; 1 5 i 5 N} = b(1). Doing linear perturbation theory, we simply iterate IF. Proving asymptotic stability of a coherent oscillation means showing that IimkAx~ ~ (=60)for an arbitrary but fixed 6. We will verify below whether 6 can be truly arbitrary. Equation 4.6 is a key result of our stability analysis. Before proceeding we consider a special solution: b , ( -!) == (r for all i and i.It is an easy task to Lrerify that h,( 1) = (r as well. That is, a uniform shift in time cannot be corrected. This is not too surprising since a system of integrate-andfire or Hodgkin-Huxley neurons or anything else that is described by a system of ordinary differential equations is unable to correct a uniform shift in time either. Mathematically, our perturbations 6 therefore have to exclude a uniform time shift. Physically, the class of perturbations induced by internal "noise" or some additional stochastic input is much more restricted. Time shifts seem to be random. More precisely, we expect them to be independent, identically distributed random variables with mean zero and finite variance. If ti with I I >> 1 denotes the number of neighbors j of neuron i, then i i r ' X , h , ( - / ) = 0, whatever 2 0 and whatever the neuron i and its surroundings, which we consider. In passing, we note that iz is typically of the order of a thousand or more in a vertebrate brain. Random perturbations occur all the time, but the ones stemming from the past should not blow up in the future; rather they should decay. That is why we have to iterate for a fixed argument 6 and show that the result approaches zero. The condition FA 0 as k x means that the matrix should have all its eigenvalues in the open unit disc {A; 1x1 < 1). The above eigenvector (1.1.. . . . 1) with eigenvalue 1 contradicts this condition. We therefore have to require that it be the only one; that is, 1 is nondegenerate (simple), its eigenvector is to be excluded, and all the other eigenvalues of IF are less than 1 in absolute value. In passing we note that, in mathematical terms, plain instead of asymptotic stability, that is, when perturbations do not blow up but need not decay, is much cheaper. We only have to require that / A / 5 1 and need not worry about any further condition. In order to interpret equation 4.6, we assume a network where each neuron receives input from I I neighbors' ( n >> 1) through homogeneous couplings Ji, = Jii - j ) where i and j are vectors on a two-dimensional iJ(i)I < x.There is no lattice and JCi) is absolutely summable, that is, 1,
-
4
'One can, but need not, think o f the set o f "neighbors" as a local ensemble. In the present context, it simply means the collection of presynaptic neurons.
What Matters in Neuronal Locking?
harm in assuming
1663
cjJq= 10,whatever i. Equation 4.6 is now rewritten
where h’, the denominator of equation 4.6, is the derivative of h in equation 4.1 taken at time T. It is bound to be positive as the membrane potential approaches the threshold from below. Furthermore, we have introduced the mean shift J o ( h ( - P ) ) = &J,/S,(-P) with j ranging through the set of n neighbors of i. Let us assume that the mean shift ( h ( - ! ) ) vanishes for all 0 2 0. If the number of neighbors n is large and perturbations are random, then ( S ( - B ) ) z 0 is a quite natural assumption. It is a simple consequence of the strong law of large numbers (Lamperti 1966; Breimann 1968). Given that (h(-P)) vanishes for all B, (h(1)) vanishes as well, a direct mathematical consequence of equation 4.7. Vanishing mean time shifts characterize a class of perturbations and thus lead to a necessary condition for a coherent oscillation to be stable. If the above argument applies, which seems fair, then this condition is also sufficient. For the moment we simply set (6(-B)) = 0 and obtain from equation 4.7
This becomes truly simple for models with short-term memory where = ~ ( s= ) 0 for s 2 2T so that the contributions E; and 7 4 can be neglected for !beyond 1 and equation 4.8 reduces to E(S)
(4.9) This is what we have used to obtain the geometric construction of Section 2. Equation 4.9 tells us two things. First, if 10 E ; > 0, then the fraction on the right is less than one, and a perturbation is bound to decrease after each spike. On the other hand, once lo.; < 0 is not too large in absolute value, a perturbation has to increase in time and the oscillation is unstable. The denominator in equation 4.9 is h‘, that is, the derivative of equation 4.1 evaluated at time T . Since T as given by equation 4.2 determines the firing time and, on firing, the membrane potential approaches the threshold 19 from below, h’ is always positive. We end up with a dichotomy: the oscillation is stable if JO E ; > 0 and unstable for Jo E: < 0. Three final remarks concerning equation 4.9 are in order. First, loE; > 0 means that firing occurs while the postsynaptic potential is increasing. Second, if the neuron has forgotten its past before the next firing so that vl vanishes, then it is bound to reappear ”in phase,” and the oscillation is asymptotically stable. Finally, a simple geometric illustration of the stability proof can be found in Figure 4.
1661
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
Figure 4: Geometric illustration of the locking argument. All neurons have fired at t = 0 except for a single neuron, which is late by an amount h" > 0. It fires again if 10 + lo:( t i (solid line) crosses the decreasing effective threshold ii ~- t i [ t ~-6')) (dashed). The n e u r ~ nis now late by an amount 6' < IlO as long as the dashed lines cross the rising part of s. One "sees'' this explicitly by comparing the projection ('1, indicated by an arrow, with if);both appear in the lower left-hand corner. If the dashed lines have intersections with the falling part of the response function s, then f i l > Ilo and the coherent oscillation is bound t o be unstable.
What happens if we relax the condition of short-term memory? Neurons with a standard dynamics such as integrate-and-fire units have r/(s)' 2 0 for all s (cf. Fig. la). As shown in the Appendix, stability then leads to the requirement (4.10) In other words, also in the general case asymptotic stability of the locked state requires that the total synaptic input be increasing at the moment when the neurons fire. This proves the necessary condition mentioned in the locking theorem. In general, one or several terms in the sum (equation 4.10) may be negative as long as the sum of all terms is positive. In fact, under the side condition of vanishing mean time shift ( n 3 m), the condition (equation 4.10) is also sufficient to guarantee asymptotic stability. The reader may wonder whether one can do without the side condition of vanishing mean shifts completely. The answer is yes, if we impose an additional constraint. We assume a standard dynamics and, in addition, require I,, F : ~ , 2 0 for all I 2 1. In other words, we have a network of inhibitory neurons whose postsynaptic potentials decay monotonically
What Matters in Neuronal Locking?
1665
or excitatory neurons whose potentials increase monotonically. Then the general stability matrix F as described by equation A.2 in the Appendix is a stochastic one. That is, its entries are nonnegative, and all row sums equal 1. The eigenvalues are in absolute value less than or equal to 1; it is indecomposable because of its special form (equation A.2); the eigenvalue X = 1 is nondegenerate; the corresponding eigenvector (1.1.. . . -1) is to be excluded; and there is no way to reduce 5 to "cyclic form" so that all the other eigenvalues are in the open unit disc {A; 1x1 < l } (Horn and Johnson 1985; Gantmacher 1959). We decompose the initial vector 6 with respect to the eigenvectors of 5 (Jordan decomposition) and iterate. Since there is no eigenvalue with 1x1 = 1 present in the decomposition, all the X k converge to zero as k + 00. So we are done. This applies in particular to a system of leaky integrate-and-fire neurons with purely inhibitory interactions. 4.3 Nasty Counterexample. What happens if the mean time shifts do not vanish? We study a simple though somewhat academic example that serves to clarify the question: What is the response if all neurons have the same time shift 6(-P), which, however, is different for different e? That is, we assume that all neurons are synchronous but slightly aperiodic and study whether the network returns to a periodic state. The network's past clearly contradicts the requirement of vanishing mean time shift. Taking advantage of equation 4.7, we get
(4.11) The corresponding matrix F (cf. the Appendix) now has the entries Foe = (V;+I + ~ O & ~ + I ) / ( C+ Il ~ o ~Oh V + ~for ;) + ~0 I e I P, - 1 in the first row and F,,, = hP,,,+1 for ,u 2 1. Because all row sums equal 1, there is an eigenvalue X1 = 1 corresponding to the eigenvector (1, 1, 1. . . .), a uniform time shift. We ask whether all other eigenvalues are less than 1 in absolute value. First we study a special case. Let us assume that 7/;+1 JooE;+l 2 0 for all e 2 0. We then arrive at a stochastic matrix and can repeat the arguments of the previous paragraph so as to conclude that all the other eigenvalues are in absolute value less than unity. Thus the neurons relax to the T-periodic state. In general, the situation is more complicated since v; + lo&; can be negative for some e. Take, for instance, em,, = 2. Then the eigenvalues are 1 (always present) and - F O ~ . Thus, stability requires -1 < 501 < 1. We have the boundary condition 500 Fol = 1. If 501 is outside the interval [--l.l], then the neurons can remain coherent but escape from the T-periodic state. The state that evolves out of such an instability can be a collective bursting with the intervals between the coherent spiking of the neurons varying systematically, for example, a limit cycle of period TI + TI where the collective interspike intervals alternate between TI and T2 (cf. the Appendix, nonvanishing mean time shifts). In contrast to
+
+
1666
Wulfram Gerstner, J. Leo van Hemmen, and lack D. Cowan
the intrinsic burster of Figure l d , this would be a network effect. The example shows that the condition of the locking theorem is necessary but need not be sufficient as soon as the side condition of vanishing mean time shift is to be dropped-for instance, because H is too small. Then additional requirements may, but need not, apply. Stepping back for an over\riew, we want to isolate what requirements guarantee that equation 4.10 is both a necessary and a sufficient condition for a coherent excitation to be stable in a spatially homogeneous network of spiking neurons. There are two conditions. First, we have to restrict the network structure and require full or, at least, high connectivity. In this case, any perturbation can be separated into a uniform time shift of all neurons and a set of single-neuron time shifts with vanishing mean. We have argued that both a vanishing mean and the absence of uniform time shifts are quite natural for system-inherent perturbations of a biological network where the number of neighbors IZ is large-the more so since coherent oscillations in the brain will last for only a finite amount of time. Second, to eliminate the-we admit, rather academicpossibility that different uniform time shifts A ( / ) lead to an "exploding" coherent oscillation, we would have to require, say, short-term memory with ;(s) = r l ( s ) = 0 for 5 2 2T. Additional, especially experimental, work is needed to explore whether this requirement is really necessary or just academic. Our results also hold in randomly diluted systems and can be extended to include variations of the parameters such as the delays (Gerstner t>t 01. 1993). A similar analysis can be used to study semicollective oscillations where the neurons spontaneously divide themselves into two or more groups of synchronous units (Gerstner and van Hemmen 1993; Gerstner 1995). 5 Discussion and Summary _
_
_
_
~
It is time to harvest some corollaries. Before doing so we discuss the essentials of our approach. We finish the paper with a summary. 5.1 Discussion. What is the gist of what we have done? We have seen that (axonal) delays in the millisecond range are quite important. The mathematics of standard stability theory for systems with delays is very intricate (Hale 1977), not to say nasty, and the upshot, an entire function with infinitely many zeros, which all have to be located and proved to possess a negative real part, is hardly accessible to immediate analysis, if any. We have therefore proposed a more biophysical approach that directly tackles the time evolution of a perturbation: a collection of time shifts. In Section 2, Figures 2 and 3, we have shown that coherent oscillations can exist in a system with purely excitatory interactions provided the de-
What Matters in Neuronal Locking?
A
.....__ .
\
A
1667
\ \
, \
-0 II0
w
t
Figure 5: Geometric method: Combination of excitatory and inhibitory couplings. All neurons have fired at t = 0. The next spike occurs once 10 + Jo€(t) (solid line) crosses the decreasing effective threshold 8 - q ( t ) (dashed line). We assume short-range inhibition (short delay) and long-range excitation (long delay). The excitatory and inhibitory contributions are indicated by dotted lines. The sum of both yields the postsynaptic potential joe(t). The oscillation with period T is stable (s) since 7[ = 0. A similar construction applies to the case of excitation with short delay and inhibition with long delay. lays are long enough, that is, exceed a lower bound. On the other hand, in networks with purely inhibitory interactions, coherent oscillations are always stable, provided the delay is less than some upper bound. Most neurobiologically relevant systems, however, consist of a mixture of both excitatory and inhibitory interactions. Here we consider two models, which are, in a sense, each other’s opposite. First, the inhibitory interaction is assumed to be short range and, hence, is to be associated with short delays. On the other hand, the excitatory interaction is long range and thus equipped with long delays. As is exemplified by Figure 5, here too a collective oscillation is stable. A companion model is the one with short-range excitation and long-range inhibition. One easily verifies that a similar construction shows that this setup also allows for stable coherent excitations. It is fair to summarize these results by saying that stability is determined by a subtle interplay between axonal delays, postsynaptic potentials, and refractory behavior. Gerstner et al. (1993) and Ritz et al. (1994) have extensively studied a system with medium- or long-range excitatory interactions and a strictly local inhibition so as to represent a local but finite-range inhibitory interaction in a simplified way. ”Strictly local” means that each neuron has a self-inhibitory loop with delay A. The analytical and computational advantages are evident, but one may wonder whether this setup can be integrated into the present formalism. The answer is in the affirmative
1668
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
a s one sees most easily by noticing that a self-inhibitory loop is nothing but a kind of refractory behavior and thus can be incorporated in rl.
5.2 Integrate-and-Fire Neurons Revisited. Finally, it may be worthwhile to discuss a subtler, though truly academic, case that has excitatory couplings with zero delay and postsynaptic potentials with a very short rise time. Most of the integrate-and-fire models studied so far belong to this class (Mirollo and Strogatz 1990; Abbott and van Vreeswijk 1993; Tsodyks et nl. 1993; Treves 1993; Usher et nl. 1993). Because interactions are now instantaneous, neurons receive an excitatory postsynaptic potential as soon as one of the presynaptic neurons fires. In particular, a neuron that is late as compared to a collective oscillation experiences an extra contribution to its membrane potential (equation 4.4) of the form /,,:(f). In other words, we have to include the i = $1 term in equation 4.4. If we start linearizing the shifts Clf we have to take care of an extra term :'((I). More precisely, let us assume that lim>+)+d:(s)/ds >> 0. Admittedly, this is somewhat academic but illustrates the underlying locking principle quite nicely. The function :is) is not differentiable at s = 0 since ~ ' ( 0=) 0 for s < 0 so Iim.-,,- d r ( s ) / d s = 0. Hence a straightforward linearization at 5 = 0 is not possible. Nevertheless, we can derive analytical results if we work out the case of positive (by > 0) and negative shifts (if< 0) separately. Let us focus on the situation where a single neuron i is too early (hp < 0) and all other neurons are firing too late by a small amount b') > 0 so that ( b y ) = 0. In this case, we can use equation 4.9 with (formally) r' < 0. Thus, ib,'I > I"pI and the shift increases. On the other hand, a neuron that is late by an amount b! > 0 will experience an input due to not only the firings of previous cycles but also to the spikes of the very same cycle. Thus, we have to include a contribution :x lim,-o $ z ( s ) >> 0. This gives a large, positive contribution and results in a new effective E' >> 0. Thus a neuron that is late with respect to a collective oscillation receives a strong locking signal and is immediately pulled back into synchronous firing. A neuron that fires too early, however, will fire even earlier during the next cycle (cf. Fig. 6). In principle it may happen that after several cycles, the neuron is early by nearly a full period. In this case we can consider it as being late as compared to the previous cycle, and, thus, it will be pulled into the collective oscillation. In the long run, it may happen that a collective oscillation rebuilds itself even though it is locally unstable. Since our mathematical argument is a local one and the above considerations are global, we cannot predict whether this actually happens. Mirollo and Strogatz (1990) have shown that for some models with delay less interactions, a collective oscillation is indeed the only solution. A different form of a global argument has been put forward by Herz and Hopfield (1995; Hopfield and Herz 1995), who analyze a system
s,
What Matters in Neuronal Locking?
1669
t
T
t
Figure 6: Excitation with zero delay. (a) In a coherent oscillation, neurons would fire with a period T given by the intersection of the decreasing effective threshold fi - q (dashed) and the excitation JOE. The whole pattern is repeated with period T. (b) If one of the neurons fires too early at time f = T + 6 ' with 5' < 0 or too late, if 6 O > 0, the decreasing threshold is shifted to the left or to the right, respectively (dotted lines). A shift to the left is increased after another period; a shift to the right is decreased. Thus, a neuron that has fired too late will be pulled back into the collective oscillation (short bar to the right of 2T), whereas a neuron that has fired too early drifts away (long bar to the left of 2T).
of nonleaky integrate-and-fire neurons with excitatory nearest-neighbor couplings JII 2 0 and indicate a Lyapunov function under the conditions &J, = J and C,J, = J. Their "ingoing" condition &], = J, whatever i, is directly understood once we invoke the geometric method so as to construct the solution self-consistently. As we have seen, local stability with four nearest neighbors is easily obtained, but it is hard to prove global stability. It is exactly here that a Lyapunov function pays off. It can be shown that for their nonleaky system with excitatory interaction, a whole family of solutions exists including the fully coherent state, partially synchronized states, and asynchronous firing (Herz and Hopfield 1995).
1670
Wulfram Cerstner, J. Leo van Hemmen, and Jack D. Cowan
5.3 Summary. In summary, being very conservative and, thus, dropping all side conditions, we have proved that a collective oscillation in a fully connected network of spiking neurons with standard dynamics and short-term memory [ / / i s ) = 0 for .s 2 2T where T is the oscillation period] is an asymptotically stable solution, if firing occurs while the response due to the input from other neurons (i.e., the postsynaptic potential) is increasing. More generally, if neuronal memory lasts longer and/or if the neurons receive input from i i < N presynaptic neurons, then an increasing postsynaptic potential is necessary but need not be sufficient for coherent spiking. The condition is the more stringent the larger the number 17 of interacting neighbors. In fact, we have argued that in n spatially homogeneous network with I I of the order of one thousand or more stability is guaranteed under the single condition of an increasing postsynaptic potential as the neurons fire. As a consequence of our locking theorem, one can analyze existence and stability of a coherent oscillation through a purely geometric method, as sketched in Section 2. Stability holds for purely inhibitory interactions with practically arbitrary delays less than a large upper bound A < A!:k and for purely excitatory input with delays exceeding a positive lower Agk,, which depends on the network parameters. Delaybound A less excitatory interactions are locally unstable, and all neurons that fire too early will drift away from the collective oscillation. We have also studied the case with both short-range inhibitory and long-range excitatory interaction--or the other way around-and found that coherent oscillations are abundantly present. This observation is also supported by a stability analysis of incoherent firing states. It can be shown that incoherent states are almost always unstable, and low-amplitude oscillations can form spontaneously (Abbott and van Vreeswijk 1993; Gerstner and van Hemmen 1993; Gerstner 1995). In other words, oscillations in a network of spiking neurons seem to be be omnipresent, and one has to explain why they are not found that abundantly in nature. That, maybe, is an interesting problem, which so far has not been faced.
Appendix In this Appendix we exhibit the full mathematical structure associated with the stability matrix F as defined in equation 4.6. First, we discuss the general mathematical framework; then we perform the stability analysis for equation 4.8.
General Formalism. Because of spatial homogeneity, there was no harm in assuming I , = lo,whatever 1. We define h’ to be the denominator of equation 3.6, denote by J the matrix (]!!) and by 1 the unit matrix,
What Matters in Neuronal Locking?
1671
and rewrite the equation as
During the next time step, S,(l) also belongs to the past. So we are working in the Hilbert space Ft, which is a direct sum of R~ with the - 1. Both rli usual inner product, labeled by l running from 0 to L,, and 6; vanish for l beyond P,, the minimal one that does this job. In H we define p by a matrix whose elements are operators. Its first row stems from equation A.l, whose left-hand side is now called 6(0), and the other rows follow from the observation that, after one period, the present has been shifted into the past, and so on. That is, (~5)(-1) = 6(0), ( ~ 6 ) ( - 2= ) 6(-1), . . . so that row p is of the form b l L , , + l l . Thus we obtain the matrix
i
A(1) A(2) A(3) ‘ . A(!max - 1) A(4nax) 1 0 0 “ ’ 0 0 1 0 ..’ 0 .............................................. 0 0 0 ‘f. 1 0
Proving asymptotic stability of a coherent oscillation means showing that limk--i3L ~ ~ (=60 )for fixed S. It is the matrix (A.2) that has to be iterated.
Stability for Vanishing Mean Time Shifts. Here we study equation 4.8. The summations on the right-hand side have 0 5 P 5 Pmax - l. The mean time shifts vanishing, the problem becomes local, restricted to i, its dimension is reduced by 1/N as compared to A.2 to P,, and we are left with a matrix whose first row has the entries Fop = A(P + 1) = 77i+1/(Cy ~ i i + ~10XuE;+,), the other entries being F,, = ShL.v+l once 2 1, and 0 5 I, LL. v 5 l,,, - 1. That is, the dimension of the problem equals tmax.We have to estimate the eigenvalues of F. In the case of short-term memory, we are left with a 1 x 1 matrix-the fraction in equation 4.9. In the case of a standard dynamics, all the 17; are nonnegative. Furthermore, X uv;+~ + 10 Ca = k’(T) > 0 tells us that the threshold in equation 4.2 is reached from below. Hence all the entries of F are nonnegative. That is, F is a ”positive” matrix. Positive matrices have remarkable properties (Horn and Johnson 1985; Gantmacher 1959). We list a few of them. They have a natural order: A 2 0 if and only if A , 2 0 for all entries of the matrix A , and A 2 B if and only if A - B 2 0. Let p ( A ) denote the maximal 1x1 of the eigenvalues X of the matrix A. By good reason p(A) is called the spectral radius. For A 5 B, one has p(A) 5 p ( B ) . Adopting for vectors x the convention x > 0 once x, > 0 for all i, one can show that Ax = Ax with A 2 0 and x > 0 implies X = p(A). Moreover, if A”’ > 0 for some rn (i.e., A is irreducible), then this eigenvalue is nondegenerate (simple) by a classical theorem of Perron and Frobenius, x > 0, and, for the (noncyclic) matrix
+
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
1672
under consideration, it is the only eigenvalue X with 1x1 = / ) ( A ) . The other eigenvalues are smaller in absolute value. As long as all the row sums are 5 1, so are all the 1x1 (by the Gersgorin circle theorem [Bellman 19701, say). We now return to our problem. The sum C / A ( I )equals 1 if and only if \,I E l E:,~ = 0. Then F is a stochastic matrix and its eigenvector x = (1.1.1.. . .) > 0 belongs to the eigenvalue /)(IF)= 1. In passing we note that the characteristic polynomial of F equals
1-1
so that X = 1 is evidently an eigenvalue. Let 6 be a matrix with CrE ; + ~ < 0 or, equivalently, 1, A ( ( ) > 1. We now allow the A ( ( ) 2 0 to increase from their old values belonging to IF to their new ones associated to IF. That is, we decrease some of the E ~ + I and in so doing increase some of the A ( { ) . We would like to stress that we can always arrange the transformation from F to 6 this way. Let us start with A({tl)and write F ( t i ) = ? + tiX where X has a single 1 in the first row at := and zeros everywhere else. By increasing ti through ti = 0 we push the eigenvalue corresponding to p( F) = 1 through 1 at a positive rate since by perturbation theory (Kato 1966) for ti = 0 / j ( F ( h - ) )=
(,(F) t h . ( y . X X ) .
('4.3)
Here y = F+y is the eigenvector of the Hermitean adjoint matrix IF' be= 1; this matrix is also positive. The longing to the eigenvalue /,(!;*) inner product (y.X x ) := y,ls,,, is strictly positive since y > 0, either by direct computation or from general considerations. Thus for t i > 0 we find /)( F(K i > 1 whereas for ti < 0 we obtain /)(IF) < 1 as a consequence of A 5 B, implying /,(A) 5 p(B).Increasing the entries A ( ( )one after the other, we arrive at the full matrix ? with p ( 6 ) > 1. The corresponding 1) and therefore cannot be eigenvector is not the uniform shift (1. excluded. This finishes the proof that > 0 is necessary and sufficient so as to guarantee that a coherent oscillation is asymptotically stable under perturbations with vanishing mean time shift.
Stability for Nonvanishing Mean Time Shifts. We now study a situation where all neurons have a common, nonzero, time shift h ( 0 . The evolution of the time shift is given by equation 4.11, which reduces in the case = 2 to
with eigenvalues Xo = 1 and XI = - F c l . The eigenvector to X i is (-Fol. 1). Let us assume Fol > 1 and consider a perturbation along the eigenvector corresponding to the eigenvalue XI. Specifically, we take h ( - 1 ) = b
What Matters in Neuronal Locking?
1673
(that is, the second last firing has been delayed by a small amount 6) a n d S(0) = -Fol S (that is, the last firing was too early by F o l 6 ) . An application of equation A.4 yields that the next firing is too late by S(1) = F& 6, the following firing is again too early by S(2) = -F& 6,and so on. It follows that, for Fol > 1, the system evolves toward a bursting state where long and short intervals alternate. For Fol < -1, the delay increases monotonically as time proceeds. The present argument is a linear stability analysis and holds in the neighborhood of the oscillatory state only. It cannot predict the new limit state that the system approaches.
Acknowledgments It is a great pleasure to Leo v a n Hemmen to thank Jack Cowan and the Department of Mathematics at the University of Chicago for the hospitality extended to him during his stay there, when this paper was conceived. We thank Carl van Vreeswijk (Jerusalem) for a careful reading of the manuscript and his constructive criticism, which greatly improved it.
References Abbott, L. F. 1990. A network of oscillators. J. Phys. A: Math. Gen. 23, 3835-3859. Abbott, L. F., and van Vreeswijk, C. 1993. Asynchronous states in a network of pulse-coupled oscillators. Phys. Rev. E 48, 1483-1490. Bauer, H. U., and Pawelzik, K. 1993. Alternating oscillatory and stochastic dynamics in a model for a neuronal assembly. Physica D 69, 380-393. Bellman, R. 1970. Introduction to Matrix Analysis. 2d ed. McGraw-Hill, New York. Breiman, L. 1968. Probability. Addison-Wesley, Reading, MA. Buhmann, J. 1989. Oscillations and low firing rates in associative memory neural networks. Phys. Rev. A 40, 41454148. Bush, I? C., and Douglas, R. J. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comput. 3, 1930. Connors, B. W., and Gutnick, M. J. 1990. Intrinsic firing patterns of diverse cortical neurons. Trends in Neurosci. 13, 99-104. Deppisch, J., Bauer, H. U., Schillen, T., Konig, P., Pawelzik, K., and Geisel, T. 1993. Alternating oscillatory and stochastic states in a network of spiking neurons. Network 4, 243-257. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Bid. Cybern. 60, 121-130. Eckhorn, R., Frien, A,, Bauer, R., Woelbern, T., and Kehr, H. 1993. High frequency (60-90 Hz) oscillations in primary visual cortex of awake monkey. NeuroReport 4, 24S246.
1674
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
Engel, A. K., Konig, P., and Singer, W. 1991a. Direct physiological evidence for scene segmentation by temporal coding. Proc. Nntl. A d . Sci. U S A , 88, 9136-9140. Engel, A. K., Kiinig, I?, Kreiter, A. K., and Singer, W. 1991b. Interhemispheric synchronization of oscillatory neural responses in cat visual cortex. Scicrzcc 252, 1177-1179. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., and Singer, W. 1992. Temporal encoding in the visual cortex: New vistas on integration in the nervous system. Trerzds Nenrosr. 15, 218-226. Ernst, U., Pawelzik, K., and Geisel, T. 1995. Synchronization induced by temporal delays in pulse-coupled oscillators. Pliys. Rcv. Lett. 74, 1570-1573. Gantmacher, F. R. 1959. Mntrir T/ltYJr!/, Vol. 2. Chelsea, New York. Gerstner, W. 1991. Associative memory in a network of “biological” neurons. In Adzmri’s it? Niwunl Ir!forvintiori Proc-t7ssirzg Systcriis 3, R. P. Lippmann, J. E. Moody, and D. S.Touretzky, eds., pp. 84-90, Morgan Kaufmann, San Mateo, CA. Gerstner, W. 1995. Time structure of the activity in neural network models. Phys. RtJZ1. E 51, 738-758. Gerstner, W., and van Hemmen, J. L. 1992. Associative memory in a network of ”spiking” neurons. Nrfioork 3, 139-163. Gerstner, W., and van Hemmen, 1. L. 1993. Coherence and incoherence in a globally coupled ensemble of pulse emitting units. Pliys. R w . L d f . 71, 312315. Gerstner, W., Ritz, R., and van Hemmen, J. L. 1993. A biologically motivated and analytically soluble model of collecti1.e oscillations in the cortex: I. Theory of weak locking. B i d . Cybcw. 68, 363-371. Golomb, D., Hansel, D., Shraiman, B., and Sompolinsky, H. 1992. Clustering in globally coupled phase oscillators. Plys. Rtv. A 45, 3516-3530. Gray, C. M. 1994. Synchronous oscillations in neuronal systems: Mechanisms and functions 1. Corripirt. Ntwrcisci. 1, 11-38. Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat 1.isual cortex. Proc. Nntl. Acnd. Sci. LISA 86, 16981702. Gray, C. M., Kijnig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Notiilc 338, 331-337. E r p t i c m . Springer, New York. Hale, J. K. 1977. Tlitwy of Firrrctioiid ll~firt~rikinl Hansel, D., Mato, G., and Meunier, C. 1995. Synchronization in excitatory neural networks. Neirrd C o n y r t . 7, 307-337. Herz, A. V. M., and Hopfield, J. J. 1995. Earthquake cycles and neural reverberations: Collective oscillations in systems with pulse-coupled threshold elements. Pliys. RPV.Lptf. 75, 1222-1225. Hopfield, J. J., and Herz, A. V. M. 1995. Rapid local synchronization of action potentials: Towards computation with coupled integrate-and-fire neurons. Proc. Nntl. Acnd. Sci. USA 92, 6655-6659. Horn, R. A., and Johnson, C. R. 1985. Mntrir Arinh/sis. Cambridge University Press, Cambridge.
What Matters in Neuronal Locking?
1675
Kato, T. 1966. Perturbation Theoryfor Linear Operators. Springer, New York. Kistler, W., Gerstner, W., and van Hemmen, J. L. 1996. Reduction of the Hodgkin-Huxley equations to an optimized threshold model. Submitted to Neural Comput. Konig, P., and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comput. 3, 155-166. Kuramoto, Y. 1991. Collective synchronization of pulse-coupled oscillators and excitable units. Physica D 50, 15-30. Lamperti, J. 1966. Probability. Benjamin, New York. Lytton, W. W., and Sejnowski, T. J. 1991. Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. I. Neurophysiol. 66, 10591079. von der Malsburg, C. 1994. The correlation theory of brain function. In Models ofNeural Networks 11, E. Domany, J. L. van Hemmen, and K. Schulten, eds., pp. 95-119, Springer, New York (reprint of the unpublished 1981 paper). von der Malsburg, C., and Buhmann, J. 1992. Sensory segmentation with coupled neural oscillators. Biol. Cybern. 67, 233-242. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-party processor. Biol. Cybern. 54, 2940. Mirollo, R. E., and Strogatz, S. H. 1990. Synchronization of pulse coupled biological oscillators. SIAM J. Appl. Math. 50, 1645-1662. Niebur, E., Kammen, D. M., Koch, C., Rudermann, D., and Schuster, H. G. 1991. Phase-coupling in two dimensional networks of interacting oscillators. In Advances in Neural lnformation Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 123-127, Morgan Kaufmann, San Mateo, CA. Nischwitz, A,, and Gliinder, H. 1995. Local lateral inhibition: A key to spike synchronization? Biol. Cybern. 73, 389400. Ritz, R., Gerstner, W., and van Hemmen, J. L. 1994. Associative binding and segregation in a network of spiking neurons. In Models of Neural Networks lI, E. Domany, J. L. van Hemmen, and K. Schulten, eds., pp. 175-219, Springer, New York. Schillen, T. B., and Konig, P. 1991. Stimulus-dependent assembly formation of oscillatory responses: 11. Desynchronization. Neural Comput. 3, 167-178. Schuster, H. G., and Wagner, P. 1990a. A model for neuronal oscillations in the visual cortex: 1. Mean-field theory and derivation of the phase equations. Biol. Cybern. 64, 77-82. Schuster, H. G., and Wagner, P. 1990b. A model for neuronal oscillations in the visual cortex: 2. Phase description and feature dependent synchronization. Biol. Cybern. 64, 83-85. Singer, W. 1994. The role of synchrony in neocortical processing and synaptic plasticity. In Models ofNeurul Networks U,E. Domany, J. L. van Hemmen, and K. Schulten, eds., pp. 141-173, Springer, New York. Skinner, F. K., Kopell, N., and Marder, E. 1994. Mechanisms for oscillation and frequency control in reciprocally inhibitory model neural networks. J. Comput. Neurosci. 1, 69-87.
1676
Wulfram Gerstner, J. Leo van Hemmen, and Jack D. Cowan
Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. PRJC.Nntl. Acnd. Sii. USA 87, 7200-7201. Trevrs, A. 1993. Mean-field analysis of neuronal spike dynamics. Netiuork 4, 259-284. Tsodyks, M., Mitkov, I., and Sompolinsky, H. 1993. Patterns of synchrony in inhomogeneous networks of oscillators with pulse interaction. Phys. Rev. Lrtt. 71, 1281-1283. Usher, M., Schuster, H. G., and Niebur, E. 1993. Dynamics o f populations of integrate-and-fire neurons, partial synchronization, and memory. N w r n l C o ~ l p ~ r5, t . 570-586. van Vreeswijk, C., Abbott, L. F., and Erinentrout, G. 6. 1994. When inhibition not excitation synchronizes neural firing. 1. C m p . Neirrosc-. 1, 313-321.
IZecei~edAugu5t 31, 1995, accepted April Y, 1996
This article has been cited by: 2. R. M. Smeal, G. B. Ermentrout, J. A. White. 2010. Phase-response curves and synchronized neural networks. Philosophical Transactions of the Royal Society B: Biological Sciences 365:1551, 2407-2422. [CrossRef] 3. Arnaud Tonnelier. 2010. Propagation of Spike Sequences in Neural Networks. SIAM Journal on Applied Dynamical Systems 9:3, 1090. [CrossRef] 4. Katherine Newhall, Gregor Kovačič, Peter Kramer, David Cai. 2010. Cascade-induced synchrony in stochastically driven neuronal networks. Physical Review E 82:4, 041903. [CrossRef] 5. Christoph Kirst, Marc Timme. 2010. Partial Reset in Pulse-Coupled Oscillators. SIAM Journal on Applied Mathematics 70:7, 2119. [CrossRef] 6. Kevin K. Lin, Eric Shea-Brown, Lai-Sang Young. 2009. Reliability of Coupled Oscillators. Journal of Nonlinear Science 19:5, 497-545. [CrossRef] 7. Brice Bathellier, Alan Carleton, Wulfram Gerstner. 2008. Gamma Oscillations in a Nonlinear Regime: A Minimal Model Approach Using Heterogeneous Integrate-and-Fire NetworksGamma Oscillations in a Nonlinear Regime: A Minimal Model Approach Using Heterogeneous Integrate-and-Fire Networks. Neural Computation 20:12, 2973-3002. [Abstract] [PDF] [PDF Plus] 8. Jun-nosuke Teramae, Tomoki Fukai. 2008. Temporal Precision of Spike Response to Fluctuating Input in Pulse-Coupled Networks of Oscillating Neurons. Physical Review Letters 101:24. . [CrossRef] 9. Marc Timme, Fred Wolf. 2008. The simplest problem in the collective dynamics of neural networks: is synchrony stable?. Nonlinearity 21:7, 1579-1599. [CrossRef] 10. Yu-Chuan Chang, Jonq Juang. 2008. Stable Synchrony in Globally Coupled Integrate-and-Fire Oscillators. SIAM Journal on Applied Dynamical Systems 7:4, 1445. [CrossRef] 11. Nicolas Brunel, Vincent Hakim. 2008. Sparsely synchronized neuronal oscillations. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 015113. [CrossRef] 12. Jean-Pierre Nadal. 2007. Modelling collective phenomena in neuroscience. Interdisciplinary Science Reviews 32:2, 177-184. [CrossRef] 13. D. Quinn, Richard Rand, Steven Strogatz. 2007. Singular unlocking transition in the Winfree model of coupled oscillators. Physical Review E 75:3. . [CrossRef] 14. Dario Floreano, Yann Epars, Jean-Christophe Zufferey, Claudio Mattiussi. 2006. Evolution of spiking neural circuits in autonomous mobile robots. International Journal of Intelligent Systems 21:9, 1005-1024. [CrossRef] 15. A. N. Burkitt. 2006. A review of the integrate-and-fire neuron model: II. Inhomogeneous synaptic input and network properties. Biological Cybernetics 95:2, 97-112. [CrossRef] 16. Nicolas Brunel , David Hansel . 2006. How Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory NeuronsHow Noise Affects the
Synchronization Properties of Recurrent Networks of Inhibitory Neurons. Neural Computation 18:5, 1066-1110. [Abstract] [PDF] [PDF Plus] 17. Jeff Moehlis, Eric Shea-Brown, Herschel Rabitz. 2006. Optimal Inputs for Phase Models of Spiking Neurons. Journal of Computational and Nonlinear Dynamics 1:4, 358. [CrossRef] 18. A. Tonnelier . 2005. Categorization of Neural Excitability Using Threshold ModelsCategorization of Neural Excitability Using Threshold Models. Neural Computation 17:7, 1447-1455. [Abstract] [PDF] [PDF Plus] 19. Masahiko Yoshioka. 2005. Cluster synchronization in an ensemble of neurons interacting through chemical synapses. Physical Review E 71:6. . [CrossRef] 20. Dario Floreano , Jean-Christophe Zufferey , Jean-Daniel Nicoud . 2005. From Wheels to Wings with Evolutionary Spiking CircuitsFrom Wheels to Wings with Evolutionary Spiking Circuits. Artificial Life 11:1-2, 121-138. [Abstract] [PDF] [PDF Plus] 21. Anthony N. Burkitt , Hamish Meffin , David. B. Grayden . 2004. Spike-Timing-Dependent Plasticity: The Relationship to Rate-Based Learning for Models with Weight Dynamics Determined by a Stable Fixed PointSpike-Timing-Dependent Plasticity: The Relationship to Rate-Based Learning for Models with Weight Dynamics Determined by a Stable Fixed Point. Neural Computation 16:5, 885-940. [Abstract] [PDF] [PDF Plus] 22. Eric Brown , Jeff Moehlis , Philip Holmes . 2004. On the Phase Reduction and Response Dynamics of Neural Oscillator PopulationsOn the Phase Reduction and Response Dynamics of Neural Oscillator Populations. Neural Computation 16:4, 673-715. [Abstract] [PDF] [PDF Plus] 23. Michael Denker, Marc Timme, Markus Diesmann, Fred Wolf, Theo Geisel. 2004. Breaking Synchrony by Heterogeneity in Complex Networks. Physical Review Letters 92:7. . [CrossRef] 24. Marc Timme, Fred Wolf, Theo Geisel. 2004. Topological Speed Limits to Network Synchronization. Physical Review Letters 92:7. . [CrossRef] 25. Masaki Nomura , Tomoki Fukai , Toshio Aoyagi . 2003. Synchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical SynapsesSynchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical Synapses. Neural Computation 15:9, 2179-2198. [Abstract] [PDF] [PDF Plus] 26. Marc Timme, Fred Wolf, Theo Geisel. 2003. Unstable attractors induce perpetual synchronization and desynchronization. Chaos: An Interdisciplinary Journal of Nonlinear Science 13:1, 377. [CrossRef] 27. Masahiko Yoshioka. 2002. Linear stability analysis of retrieval state in associative memory neural networks of spiking neurons. Physical Review E 66:6. . [CrossRef] 28. W. M. KISTLER, M. T. G. JEU, Y. ELGERSMA, R. S. GIESSEN, R. HENSBROEK, C. LUO, S. K. E. KOEKKOEK, C. C. HOOGENRAAD, F. P. T. HAMERS, M. GUELDENAGEL, G. SOHL, K. WILLECKE, C. I. ZEEUW.
2002. Analysis of Cx36 Knockout Does Not Support Tenet That Olivary Gap Junctions Are Required for Complex Spike Synchronization and Normal Motor Performance. Annals of the New York Academy of Sciences 978:1 THE CEREBELLU, 391-404. [CrossRef] 29. Werner M. Kistler , Chris I. De Zeeuw . 2002. Dynamical Working Memory and Timed Responses: The Role of Reverberating Loops in the Olivo-Cerebellar SystemDynamical Working Memory and Timed Responses: The Role of Reverberating Loops in the Olivo-Cerebellar System. Neural Computation 14:11, 2597-2626. [Abstract] [PDF] [PDF Plus] 30. Marc Timme, Fred Wolf, Theo Geisel. 2002. Coexistence of Regular and Irregular Dynamics in Complex Networks of Pulse-Coupled Oscillators. Physical Review Letters 89:25. . [CrossRef] 31. Ivan Matus Bloch, Claudio Romero Z.. 2002. Firing sequence storage using inhibitory synapses in networks of pulsatil nonhomogeneous integrate-and-fire neural oscillators. Physical Review E 66:3. . [CrossRef] 32. Marc Timme, Fred Wolf, Theo Geisel. 2002. Prevalence of Unstable Attractors in Networks of Pulse-Coupled Oscillators. Physical Review Letters 89:15. . [CrossRef] 33. Ulrich Hillenbrand. 2002. Subthreshold dynamics of the neural membrane potential driven by stochastic synaptic input. Physical Review E 66:2. . [CrossRef] 34. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 35. Osamu Araki , Kazuyuki Aihara . 2001. Dual Information Representation with Stable Firing Rates and Chaotic Spatiotemporal Spike Patterns in a Neural Network ModelDual Information Representation with Stable Firing Rates and Chaotic Spatiotemporal Spike Patterns in a Neural Network Model. Neural Computation 13:12, 2799-2822. [Abstract] [PDF] [PDF Plus] 36. Richard Kempter , Wulfram Gerstner , J. Leo van Hemmen . 2001. Intrinsic Stabilization of Output Rates by Spike-Based Hebbian LearningIntrinsic Stabilization of Output Rates by Spike-Based Hebbian Learning. Neural Computation 13:12, 2709-2741. [Abstract] [PDF] [PDF Plus] 37. J. Eggert , J. L. van Hemmen . 2001. Modeling Neuronal Assemblies: Theory and ImplementationModeling Neuronal Assemblies: Theory and Implementation. Neural Computation 13:9, 1923-1974. [Abstract] [PDF] [PDF Plus] 38. Carlo R. Laing , Carson C. Chow . 2001. Stationary Bumps in Networks of Spiking NeuronsStationary Bumps in Networks of Spiking Neurons. Neural Computation 13:7, 1473-1494. [Abstract] [PDF] [PDF Plus] 39. C. van Vreeswijk , D. Hansel . 2001. Patterns of Synchrony in Neural Networks with Spike AdaptationPatterns of Synchrony in Neural Networks with Spike Adaptation. Neural Computation 13:5, 959-992. [Abstract] [PDF] [PDF Plus]
40. Joel Ariaratnam, Steven Strogatz. 2001. Phase Diagram for the Winfree Model of Coupled Nonlinear Oscillators. Physical Review Letters 86:19, 4278-4281. [CrossRef] 41. L. Neltner , D. Hansel . 2001. On Synchrony of Weakly Coupled Neurons at Low Firing RateOn Synchrony of Weakly Coupled Neurons at Low Firing Rate. Neural Computation 13:4, 765-774. [Abstract] [PDF] [PDF Plus] 42. Steve Kunec, Amitabha Bose. 2001. Role of synaptic delay in organizing the behavior of networks of self-inhibiting neurons. Physical Review E 63:2. . [CrossRef] 43. Robert Urbanczik, Walter Senn. 2001. Similar NonLeaky Integrate-and-Fire Neurons with Instantaneous Couplings Always Synchronize. SIAM Journal on Applied Mathematics 61:4, 1143. [CrossRef] 44. P. C. Bressloff , N. W. Bressloff , J. D. Cowan . 2000. Dynamical Mechanism for Sharp Orientation Tuning in an Integrate-and-Fire Model of a Cortical HypercolumnDynamical Mechanism for Sharp Orientation Tuning in an Integrate-and-Fire Model of a Cortical Hypercolumn. Neural Computation 12:11, 2473-2511. [Abstract] [PDF] [PDF Plus] 45. Gennady S. Cymbalyuk , Girish N. Patel , Ronald L. Calabrese , Stephen P. DeWeerth , Avis H. Cohen . 2000. Modeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI ApproachModeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI Approach. Neural Computation 12:10, 2259-2278. [Abstract] [PDF] [PDF Plus] 46. Maurice Chacron, André Longtin, Martin St-Hilaire, Len Maler. 2000. Suprathreshold Stochastic Firing Dynamics with Memory in P-Type Electroreceptors. Physical Review Letters 85:7, 1576-1579. [CrossRef] 47. L. Neltner , D. Hansel , G. Mato , C. Meunier . 2000. Synchrony in Heterogeneous Networks of Spiking NeuronsSynchrony in Heterogeneous Networks of Spiking Neurons. Neural Computation 12:7, 1607-1641. [Abstract] [PDF] [PDF Plus] 48. Carson C. Chow , Nancy Kopell . 2000. Dynamics of Spiking Neurons with Electrical CouplingDynamics of Spiking Neurons with Electrical Coupling. Neural Computation 12:7, 1643-1678. [Abstract] [PDF] [PDF Plus] 49. Jan Karbowski , Nancy Kopell . 2000. Multispikes and Synchronization in a Large Neural Network with Temporal DelaysMultispikes and Synchronization in a Large Neural Network with Temporal Delays. Neural Computation 12:7, 1573-1606. [Abstract] [PDF] [PDF Plus] 50. D. Golomb , D. Hansel . 2000. The Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal NetworksThe Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal Networks. Neural Computation 12:5, 1095-1139. [Abstract] [PDF] [PDF Plus] 51. J. Rubin , D. Terman . 2000. Geometric Analysis of Population Rhythms in Synaptically Coupled Neuronal NetworksGeometric Analysis of Population Rhythms in Synaptically Coupled Neuronal Networks. Neural Computation 12:3, 597-645. [Abstract] [PDF] [PDF Plus]
52. J.G. Taylor, B. Krause, N.J. Shah, B. Horwitz, H.-W. Mueller-Gaertner. 2000. On the relation between brain images and brain neural networks. Human Brain Mapping 9:3, 165-182. [CrossRef] 53. Jianfeng Feng, David Brown, Guibin Li. 2000. Synchronization due to common pulsed input in Stein’s model. Physical Review E 61:3, 2987-2995. [CrossRef] 54. J. Eggert, J. van Hemmen. 2000. Unifying framework for neuronal assembly dynamics. Physical Review E 61:2, 1855-1874. [CrossRef] 55. Wulfram Gerstner . 2000. Population Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and LockingPopulation Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and Locking. Neural Computation 12:1, 43-89. [Abstract] [PDF] [PDF Plus] 56. Paul C. Bressloff , S. Coombes . 2000. Dynamics of Strongly Coupled Spiking NeuronsDynamics of Strongly Coupled Spiking Neurons. Neural Computation 12:1, 91-129. [Abstract] [PDF] [PDF Plus] 57. P. C. Bressloff, S. Coombes. 2000. A Dynamical Theory of Spike Train Transitions in Networks of Integrate-and-Fire Oscillators. SIAM Journal on Applied Mathematics 60:3, 820. [CrossRef] 58. Shannon R. Campbell , DeLiang L. Wang , Ciriyam Jayaprakash . 1999. Synchrony and Desynchrony in Integrate-and-Fire OscillatorsSynchrony and Desynchrony in Integrate-and-Fire Oscillators. Neural Computation 11:7, 1595-1619. [Abstract] [PDF] [PDF Plus] 59. Nicolas Brunel , Vincent Hakim . 1999. Fast Global Oscillations in Networks of Integrate-and-Fire Neurons with Low Firing RatesFast Global Oscillations in Networks of Integrate-and-Fire Neurons with Low Firing Rates. Neural Computation 11:7, 1621-1671. [Abstract] [PDF] [PDF Plus] 60. Werner M. Kistler , J. Leo van Hemmen . 1999. Short-Term Synaptic Plasticity and Network BehaviorShort-Term Synaptic Plasticity and Network Behavior. Neural Computation 11:7, 1579-1594. [Abstract] [PDF] [PDF Plus] 61. P. Bressloff. 1999. Mean-field theory of globally coupled integrate-and-fire neural oscillators with dynamic synapses. Physical Review E 60:2, 2160-2170. [CrossRef] 62. A. N. Burkitt , G. M. Clark . 1999. Analysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike OutputAnalysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike Output. Neural Computation 11:4, 871-901. [Abstract] [PDF] [PDF Plus] 63. Alexander Dimitrov , Jack D. Cowan . 1998. Spatial Decorrelation in Orientation-Selective Cortical CellsSpatial Decorrelation in Orientation-Selective Cortical Cells. Neural Computation 10:7, 1779-1795. [Abstract] [PDF] [PDF Plus] 64. P. Bressloff, S. Coombes. 1998. Desynchronization, Mode Locking, and Bursting in Strongly Coupled Integrate-and-Fire Oscillators. Physical Review Letters 81:10, 2168-2171. [CrossRef]
65. J. Pham, K. Pakdaman, J.-F. Vibert. 1998. Noise-induced coherent oscillations in randomly connected neural networks. Physical Review E 58:3, 3610-3622. [CrossRef] 66. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef] 67. Eve Marder. 1998. FROM BIOPHYSICS TO MODELS OF NETWORK FUNCTION. Annual Review of Neuroscience 21:1, 25-45. [CrossRef] 68. Werner M. Kistler, Wulfram Gerstner, J. Leo van Hemmen. 1997. Reduction of the Hodgkin-Huxley Equations to a Single-Variable Threshold ModelReduction of the Hodgkin-Huxley Equations to a Single-Variable Threshold Model. Neural Computation 9:5, 1015-1045. [Abstract] [PDF] [PDF Plus]
Communicated by Mikhail Tsodyks
Neural Correlation via Random Connections Joshua Chover Department of Mathematics, University of Wisconsin-Madison, Madison, Wisconsin 53706 USA
A simple neural network is studied, which has sparse, random, plastic,
excitatory connections and also feedback loops between sensory cells and correlator cells. Time is limited to several discrete instants, where firing is synchronous. For parameter values within biological ranges, the system exhibits a capacity for associative recall, with a controlled amount of extraneous firing, following Hebb-like synaptic changes. 1 Introduction It is an old question: How do two sensory inputs, say B (for bell) and F (for food), become correlated with each other through repetition, so that a later appearance of one (e.g., B) will trigger physiological consequences due to the other? One way involves direct synaptic connections between the respective cells representing B and F (see, e g , Willshaw et al. 1969). Taking into account many different possible pairs, B and F, an enormous number of direct connections would be required. A second (complementary) scheme is to have B cells and F cells synapse together onto third-party cells (C cells) where repeated simultaneous B and F inputs mutually enhance their respective synapses (see, e g , Levy and Steward 1983; McNaughton and Morris 1987; and references therein). This scheme raises further problems: If an individual C cell correlates only one pair (e.g., B and F) of particular sensory inputs, then to accommodate many differing pairs, an enormous number of C cells would be necessary. On the other hand, if an individual C cell receives inputs from many sensory sources, when that cell does fire, how would a downstream cell recognize which sensory source (e.g., B ) caused the C cell to fire; and if it did, how would it "know" to which other sensory-motor area (e.g., involving F) to carry the information? For the second scheme to work, it seems clear that many C cells must cooperate and that fairly specific feedback pathways from C cells to sensorimotor areas must exist. Even so, can the job be done (as it evidently is) without the number of correlating C cells being of higher order of magnitude than the total number of sensory cells? The present model explores these questions in a simple setting of random (low-probability) connections between sensory cells and correlator cells. The sparsity of the connections allows an escape from otherwise Neural Computation 8,1711-1729 (1996) @ 1996 Massachusetts Institute of Technology
1712
Joshua Chover
combinatorial correlation requirements, while still allowing B to recall a large part of F . In the notation to be used, B and F , during their repeated joint presentation, will appear as subsets of a set A of all sensory cells then firing. The rephrased question will be: When B fires alone at a later time (as a prompt), what portion of A can be recalled? The recall will include some extraneous firing. To what extent can that be controlled? A second goal of the present model (continuing work in Chover 1994) is to explore the idea of recall as provided by a transient response to a stimulus (say, during a 20-40 msec period). This approach seems worthwhile since inputs to biological neural networks change rapidly (within 100 msec intervals). The study here thus centers on a few successive spatially distributed (synchonous) cell firing patterns, rather than on firing rates (as in the vast literature following Hopfield 1984). Transient recall differs fundamentally from the idea of memories as "attractors" to which a dynamic system converges through time (see discussion and references in Treves and Rolls 1991). By contrast, the dynamical aspect of the present model consists merely of one round trip through three "instants" of synchonous firing: from a sensory cell prompting pattern, to a pattern of firing among correlator cells, and back to a response pattern among sensory cells. The round trip is through synapses whose weights are supposed to have been altered by earlier exposure to the stimulus. N o supervised learning is assumed for such synaptic changes, only a (modified Hebb-like) condition on synaptic activity. Synchronous firing (during instants) is assumed both for simplicity in analysis and in view of empirical evidence of near-synchonous volleys (see., e.g., Engel rt al. 1992). With its focus on network connections, the model uses the simplest sort of "cell": sum-and-threshold (McCullough-Pitts) devices. The roles of complicated intracellular mechanisms and of inhibitory interneurons remain implicit. In a biological system, even transient recall will not be limited to three volleys of firing after a prompt. However, as a crude measure of success for such recall, two system performance characteristics are proposed here that involve only these early volleys and can be estimated in terms of system parameters. These characteristics are ratios, which essentially compare the amount of overlap between (sensory) stimulus and response firing patterns to the amount of original stimulus firing. Formulas that estimate the characteristics are presented in the Results section (Section 3 ) , together with evaluations for a specific case of parameter values that have biological plausibility (Section 3.1). Although a search through parameter space is beyond the scope of this paper, the evaluated case suggests that good transient recall is possible if the prompt represents at least 45 to 50% of the original stimulus. A toy simulation (Fig. 4) is suggestive also. In summary, the present model looks through a brief time window at synchronous spatial firing patterns of sum-and-threshold cells whose
Neural Correlation via Random Connections
1713
sparse connections have been previously altered by unsupervised (Hebblike) learning. 2 The Model
The formal model consists of two sets of cells: a set S of M sensory cells s and a set C of N correlator cells c, related as follows: 1. Initially, connections (s-tc) and feedback connections (c+s) are established, with probability p for any (s,c) pair, independently between pairs. These then stay fixed for the rest of the analysis, except for a final averaging over possible connection patterns. (Physiologically, the direct connections might be considered as having been established during early development, with the feedback connections following as a result of network activity. (See Sections 4.1 and 4.7 for discussion.) When a cell fires (see below), it sends an action potential (AP) along its outward bound connections. 2. Time proceeds in several discrete insfanfs, f l , t 2 . and f3. This is for simplicity and also because of observed near-synchonicity in firing. Physiologically, an instant may also be interpreted as a short time window. For example, after synchronous firing of cells in the olfactory bulb, action potentials travel along the lateral olfactory tract, causing a rostral-to-caudal wave (or volley) of firings by pyramidal cells in the pyriform cortex. This volley lasts about 10 to 15 msec, before further pyramidal cells’ firing due to feedback connections, and in the model would appear as a single instant of synchronous firing. (See Haberly 1984; Ketchum and Haberly 1993.) 3. Potentials: When s sends an AP to c through a connection s-+c, a unit minipotenfial (represented by a 1) is contributed to the potential V , of c for the next instant, V , being the sum of such minipotentials. Similarly when c sends an AP to s through c-s, a unit minipotential is contributed to the potential V , of s for the next instant. (V, may also receive contributions from outside the system via afferent connections.) 4. Firing: If firing in S at the preceding instant causes V , 2 0 (for c in C), then c will fire at the given instant. If firing in C at the preceding instant causes V , 2 8’ (for s in S), then s will fire at the given instant. (0 and 8’ are the firing thresholds.) Once a cell has fired, it can fire again during the next instant. (The modeling assumes that refractory and feedback inhibitory effects subside in the meantime. Long latency feedforward inhibition is not incorporated here, although see Section 4.3 for discussion.) 5. A stimulus will be represented by a subset A of S, composed of cells that fire synchronously at a given instant (as a result of afferent input). Such firing subsets will be the focus of analysis here.
Joshua Chover
1714
6. A ”modified Hebb” learning condition: For purposes of the present model, the training details whereby a stimulus A is learned need not be specific. It is simply assumed that if a connection s + c transmitted an AP repeatedly during the learning process and if the corresponding postsynaptic potentials V , were suitably elevated, V , 2 11 (a plasticity tlzreshold), then this connection became enhanced. Enhancement means that the connection now contributes minipotentials of size 7 (-, > 1) to V , in response to APs. (Notation: s +, c, where 7 is theforiiurdgain .) The corresponding feedback connection also became enhanced, contributing -!’ minipotentials to V , during further transmissions of APs. (Notation: c--”,Is, where 3’ is the backward go ii I .) 7. It will be assumed that prior to the presentation (for learning) of a stimulus A to be studied, some of the connections in the network will already have become enhanced by the learning of previous stimuli. In particular, any given connection s i c will have become enhanced with probability 0, independently between connections and independently of the newly firing subset A. ( p and p are the only probability parameters of the model.)
The following notation is used to help describe the simple three-step transient dynamics of the system: For a subset A of S, considered as the firing set at a given instant, c A will denote the sequel set of all cells c in C that will fire at the next instant because Vi >_ H. Similarly in the feedback direction, for a firing subset D of C at a given instant, sD will denote the sequel set of all cells s in S that will fire at the next instant because V , 2 0‘. Thus for the round trip S C -+ S, the set s c A = s ( c A ) denotes the set of all cells in S that will fire two instants after the firing of the set A (assuming no new afferent inputs). (See Fig. 1.) After all possible synaptic enhancements due to learning of A, the sequel set of A will be denoted by ?A. (If the plasticity threshold r/ < 0, ? A will be larger than cA.) s?A will then denote the set of all cells firing in S two instants after A fires. Picture an enormous population of mice with individual mice representing various possible s c configurations, with frequencies dictated by the randomness postulated in part 1 of the model. Given an outside stimulus that creates the same sensory firing set A in each mouse, the goal is to determine the consequences in a typical nroiise (a concept inherent in the law of large numbers). For example, what percentage of A will be represented by the overlap A n siB, and how does the size of the cxtraneozis set s?A - A compare with that of A? Given that the set A has size ni, the sizes IcAl, lscAl, and so on, of the sequel sets will vary from mouse to mouse and are represented for the typical mouse by their expected values ElcAI, ElscAI, and so ontheir averages across the set of all mice according to the appropriate frequencies. Because of the randomness assumed in the model, the vari-
-
-
Neural Correlation via Random Connections
1715
Figure 1: A firing set A and its sequels CAin C and scA in S again. Overlap set A n SCA,darkest shading; extaneous set scA - A, lightest shading.
ables IcAl, IscAl, and so on have binomial distributions (when suitably conditioned); and when the parameters M , N, and m are large, normal approximations can be used to calculate the expected values. (For details see the Appendix.) Suppose that A, of size m, has been learned, so that synapses between A and ?A are all enhanced with gain y and that a prompting subset B of A, of size Sm (0 < S 5 1)now fires. Two instants later the sequel sEB will overlap the omitted subset A-B by the fraction (l/(l-h)m)EI(A-B)nsEBI, which provides a crude measure of the amount by which B recalls A and will be denoted RR[S],recall rate. There will also be an extraneous set whose expected size (l/m)E(sEA-Al, relative to A, will be denoted X R , extraneous rate. This represents the maximum possible extraneous firing for all subsets B c A. Although crude, the ratios RR and X R are relatively easy to estimate in terms of parameters and are proposed here as tentative performance characteristics for the system. See Figure 2 for a flow diagram of the model.
3 Results
Estimates for quantities of interest are developed in the Appendix and listed for reference in Section 3.7. A general reader may skip the formulas. The estimates yield the following results.
Joshua Chover
1716
Figure 2: A flow diagram for the model. The first three boxes refer to background events. The second three boxes (instants f l . tz. and t i ) refer to a transient response to a prompt, the object of study. (For parameter definitions, see the text.)
--
-
3.1 An Illustrative Case. Suppose that M 10" and N 8 x lo6 (e.g., S and part of C could be overlapping subsets of piriform cortex). Take p .005, giving about 5 x lo3 possible s-c synapses onto an average c cell (consistent with piriform cortex; see Haberly 1985). Suppose that 0 118 and H' 100, measured in terms of unit minipotentials averaging 2 0.2 mV each (consistent with hippocampal measurements; see Traub and Miles 1991). Take ti = H; that is, the plasticity threshold is the same as the firing threshold (according to Hebb 1949); and suppose that 7 = -1' = 3.7, a gain consistent with hippocampal long-term potentiation (LTP) results involving already mature synapses (see Nicoll et nl. 1988). Finally take tri == lo4, representing a lo/o firing rate in S due to afferent stimuli; and let /) = 0.2, so that 20% of synapses are already enhanced. Now suppose that a stimulus set A of size 171 fires repeatedly so that all connections from A to EA are enhanced and that subsequently a prompting subset B of size hrn is presented. For the above parameter values, the right-hand curve in Figure 3 indicates the dependence of the recall rate RR[(r]on 6. There is a steep falloff around 6 = 0.5 from nearly perfect recall for Cr > 0.5 to no recall for low (r values. (The steepness is due to the relatively small variances of the size distributions involved.) The excess firing rate is X R = 0.11,
-
-
-
Neural Correlation via Random Connections
1717
RR
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
Prompting fraction b
Figure 3: Recall rate XR[6] = (1/(1- h ) m ) E / ( A- B ) n s?BI versus the fraction 6 = (l/rn)lBl of the original set A presented. Left-hand data set for enhanced synapse density p = 0.15 (and thresholds H = 7 = 107). Right-hand data set for p = 0.2 (and 0 = 7 = 118). Other parameters as in the text (Section 3.1). For comparison, suppose that the enhancement fraction is reduced to p = 0.15. This necessitates lower C thresholds, say, 8 = rl = 107; otherwise C cells would hardly fire at all (see Section 4.3 for discussion). With these changes the dependence of RR on h is indicated by the lefthand curve in Figure 2; with falloff near a lower value, 6 = 0.454. The corresponding XR is 0.10. The number k of independent, previously learned stimuli necessary to have enhanced synapses to the level p = 0.2 exceeds 21,750; for the level = 0.15, k > 14,208. (See Section 3.4.) 3.2 Signal versus Noise. In general, parameter values can be chosen to make the recall rate RR[6]close to one and simultaneously to have XR close to zero. An algorithm for determining such parameter configurations is indicated in Section 3.6. (As an extreme example, in the case in Section 3.1 (with 8 = 118 and p = 0.2) the value of RR[0.45] could be raised from 0.015 to 0.897 and XR could be lowered from 0.11 to 0.076 by increasing N to 3.05 x lo9 and 8’ to a nonphysiological value of 28,814.) The recall curve RR[S] will still have a steep falloff as in Figure 3.
Joshua Chover
1718
3.3 Sensitivity to Parameters. The expected sizes and recall rates are sensitive to small percentage changes in parameter values because most of the set sizes have standard deviations (from mouse to mouse) that are very small compared to their means. For example, a 5% increase in iiz == /A1 in the case in Section 3.1 will lead to a sequel with 94% of the cells outside A firing extraneously. However, sircli semitiz~itymust be distifiguished from the robzrstizess of the basic setup. See the discussion in Section 4.2.
3.4 Load. The precise nature of the recalled sequel A n si-B for a set ,4 prompted by a subset B of A is, of course, determined by the basic
synaptic connections and limited by the correlation set ?A, even when A-?A synapses are all enhanced by previous learning. In such recall, the “spurious” effects caused by feedback from ?A to s-cells of all other learned sets ( A , ) are limited by the basic configuration of connections (e.g., to at most 11% beyond the size of A in the case in Section 3.1). What maximal number ( k )of learned sets A,can thus be registered in the system to produce a density p of enhanced connections? For given 1) and given tolerances on RR[?]and X R , k will be a measure of the network’s (present) load. Supposing that the k sets A, were chosen independently of each other and randomly from among all subsets of size iii in S, one gets a lower bound estimate for the load:
The estimate varies directly with 1) and inversely with the firing rate irnlM) for A in S and the firing rate for ?A in C (which latter also depends on ( J ) . (In the case in Section 3.1 with H = 11 118 and p = .2, k > 21.750.) (See equations 3.4 and A.5 for a more exact relationship.) For given k, how much overlap will there be among the independently chosen A,’s? Choose a subset size b lB1, with 3iii’lM 5 b < ni; and consider the event that some subset B in S, of size 0, will be contained in two or more of the ,4,’s. This event will have a probability ( Q [ k .b ] ) , which is negligible when k < (Mb/3ni2)’”’. The result comes from an upper bound for Qjk. bj, which is developed in the Appendix (see equation A.6). For parameters as in Section 3.1, Q [ k .b] is negligible even if B, considered as a possible prompting set, is so small (say, b = ( . l ) m ) as to have negligible sequel sets. Thus, the 11,’s will scarcely overlap if chosen independently. Moreover, their sequel sets s?A, will have small overlaps also (as can be seen by choosing B = A , n A , and considering A, - B and A , - B as prompting sets). : 1
3.5 Variability of Performance. If each mouse processes all k of the sets A, chosen in Section 3.4, how will it perform on average (over the A,)? Suppose that each A, induces gains -, and later is prompted (for each mouse) by a subset B , of size h i l l . Then, for each mouse, there will
Neural Correlation via Random Connections
1719
be a corresponding recall rate (l/rn( 1 - 6))I (A, - B,) n sEB, I, and these can be averaged as k
AVER
=
(l/k) x ( l / m ( l
-
S))J(A,- B,) n SEB,~.
k=l
When k is large, this average can have a small (mouse-to-mouse)variance and will thus be close to its mean, the typical value RR[b].For example, with parameters as in the case in Section 3.1, AVER will differ from the RR(S]value given in Figure 3 by no more than 0.03, in at least 91% of the mice. (See equation A.8 in the Appendix, with d = (0.l)rn.) That is, most mice do well with many (independently chosen) A,’s and B,’s (provided 6 exceeds the falloff value). A similar result holds for the corresponding average (over A,) of extraneous firing fractions (l/rn)lsSA,- A,I, k
AVEX
=
(l/k)C Isi?AA, -All. r=l
With parameters again as in the case in Section 3.1, AVEX differs from the typical X R value by no more than 0.03 for at least 97% of the mice (see equation A.9). The above estimates of the averages AVER and AVEX by their corresponding mean values RR and X R hold simultaneously in at least 88% of the mice. 3.6 A General Rule. In order for the system to be neither overloaded nor dead, the distance between average potential and threshold (in terms of standard deviations) must be within close bounds (e.g. 5 3). From this it follows that the population firing rate ( m / M ) in S must differ by only a moderate percentage (e.g., 17% for the case in Section 3.1) from the proportion of synapses needed to fire a receiving cell in C. (See equation 3.5.)
3.7 Technical Estimates. The approximations (-) are calculated in which gives the area under the graph of the terms of the function standard normal density function cp to the right of the value x (see the Appendix). When M . N, and rn = IAl are large and p is small, the expected size
$[XI,
ElCAI
- “4.
where
P
-
(Q -
(YP+ 1 - P ) P ~ ) / ( ( Y 2 + P 1- P)Pm)”2
If A has been learned, so that synapses s - c
between A and ?A are
Joshua Chover
1720
enhanced with gain
-,
-
> 1, and if
ti
5 0, then
€/?A/ Nt.[ j ] where
i-= i r / -
(em/)
1 - / ~ j p i ~ ~ ) ~ (+ ( -1 , 2-/ ),,)piz)’ ’.
(3.1)
and the extraneous rate X R := (l/iu)EIsTA- A1 can be approximated as
XR
-
(M/r?i-1) X
t’[(H’
-
( - # ’ / ) +1 -
,,)Nc’[.j]):(((-,’)’,J+1 - p)Nc~[,?])”~13.2)
(see Appendix equation A.2). If a subset B c A of size b m ( 0 < b < 1) now fires, then its sequel CB of firing c cells one instant later will have expected size Nq[b];where q [ b ] ,the firing probability for a single c cell, is a complicated quantity given by integrals in equation A.3. The recall - B ) fl stlB/ can be approximated as rate RRIb] := (1/(1- b)?n)E/(A
- (:[((o’!:,’)
- j7q!h]N):’(F’’q[b Nj’
Rli[h]
21
(3.3)
(see equation A.4), where F, is now the conditional probability for there to be a connection s-c (where s is in A - B ) ,yizwz that c is in FA. p is evaluated in equation A.S. A lower bound for k in terms of is given by
-
k > - h [ l - F J ] ( ~ / M ) ~ ’ ( / ~ A // /J N ( ~) i~i ~’ M ) ~ ~ ’ ( l c A i ~ N(3.4) )~’ (see equation A.5). Note that to avoid over- or underload, the quantity ,iin equation 3.1 must be bounded, say I.jI 5 3, which yields
I(~/(-,/P-
1
~
/jipM) t
(??l/M -)11
5 3((*m2,)+1 - / J ) / ( “ , / ) +1 - / , ) ) ” 2 ( p V l ) p i . (3.5) the result in Section 3.6. 3.8 Regions in Parameter Space. For small t . t’ > 0, one can set the recall rate RR[b]at 1 - c and the extraneous firing rate X R at c’ in equations 3.3 and 3.2 to get relationships
(0’- *>’pNq)/((-,’)2pNq); I.-’ 1 1 - fl h
(3.6)
and
(0’-
-
(++
+
1- / ) ) F 7 ~ I . . [ j ] ) j ( ( j - , ’ ) 2 , 1 1 - / ) i F 7 ~ L , [ j ] ) f
~ ~ - ‘ [ < ’ / ( ( M-/ S)]. ~?~)
(3.7)
which can then be solved for several of the parameters in terms of the others.
Neural Correlation via Random Connections
1721
Also, given a range of allowed RR and XR values (i.e., of allowed equations 3.6 and 3.7, together with A.l and A.3, then define corresponding regions in parameter space. A detailed analysis of such regions is beyond the scope of this article, but various features are evident. For example, an increase in s cell firing threshold 8’ can be compensated by an increase in feedback gain y’,or by an increase in the size (N) of C, or by an increase in the average number ( p N ) of synapses per c cell. Similarly, an increase in stimulus size m can be compensated by an increase in the c cell firing threshold 0, or by a decrease in feedforward gain 2 , or by a decrease in pN. F. f’),
3.9 Further Stochastic Interpretations. For simplicity the stimulus size in the model has been fixed at m. Alternatively, it could be supposed that stimuli A are chosen from an ensemble created by having each s cell fire with probability cy independently of other s cells; so that m = ruM now represents an average stimulus size. The resulting change in the membrane potential V , and subsequent quantities leading to RR and X R can be shown to be negligible when p << 1 (see Chover 1994, eq. A3.3). Similarly the parameter p has been defined as the probability of a connection s +c; but with a negligible change in results, it can be reinterpreted as the probability of a “currently working” connection. Thus, for example, the average number of actual synapses could be 25% greater than in the case in Section 3.1, with each synapse having a 25% failure probability; and the values in Figure 3 would remain essentially unchanged. 3.10 Simulation. Several simulations were done for a toy network, with parameters M = 300, N = 1700, m = 101, p = 0.1, H = 17 = 28, 8’ = 20, y = y’ = 4, p = 0.2, and 6 = 0.5, with stimuli chosen randomly and independently of choices for connections and enhancements. Figure 4 displays one such set A and a sequel sEB. F‘s indicate firing cells and 0’s indicate silent cells; the left half of each column indicates F’s in A; and the right half indicates F‘s in sZB. The prompting set B was obtained by replacing all F’s by 0’s in the lower half of the A pattern. The actual recall and extraneous rates here are ((A- B)nsZ.B\/(A-B( = 0.92 (much better than predicted) and \SEA- A(/IAI = .21; k > 15.
4 Discussion and Conjectures 4.1 Overall. The results give evidence that a network (e.g., piriform cortex) with sparse divergent and convergent connections between stimulus (sensory) cells and correlating cells can provide associative recall from prompting subsets of moderate size and can do so in short time (two waves of activity). The sparse connections provide sufficient partial correlations between different parts of the stimulus while avoiding the
Joshua Chover
1722
FF
FF FF FF
00 00
00
00
00
00
00
00
00 00
FF
FF
00
00
00
00
00
00
FF
00
00
00
00
00
Fo
00
FF
00
00
00
FF FF
00
FF
FF
00
00
FF
00
00
00
FF
00
FF
00
00
00
00
00
00
00
00
00
00
FF
00
FF FF
FO
00
00
00
FF
00
00
FF
FF FF FF
FF FF
00
00
00
00
FF
FF
00
00
00
00
00
00
00
FF
00
00
00
00
FF
00
FF
00
00
00
00
00
00
00
FF FF FF FF FF
00
00
00
00
00
FF
FF
FF
00
00
00
00
FF
00
Fo
FF FF
00
00
FF FF
00
FF FF FF
00
00
00
00
00
00
00
00
00
FF
00
00
00
00
00
00
00
00
00
FF
00
FF
00
00
00
FF
00
00
00
00
00
00
FF
00
00
00
FF
FF FF
00
00
FF Fo
00
00
00
Po
00
00
00
FF
00
00
00
00
00
00
00
Fo
00
00
00
00
00
00
00
00
00
00
FF
00
FF
00
00
FF
00
FF
00
00
00
00
FF FF
00
00
00
00
FF FF
FF FF
00 00 00
00
00
00
FF FF FF FF FE FF FF
FF FF OF
00
00
FF FF FF
00
00
00
FF
00
00
00
00
00
FF
FF FF
00
00
00
00
FF
00
00
00
00
FF
00
00 00
00
00
00
00
00
00
00
00 00
00
00
FF FF
FF FF FF
Fo
00
00
FF FF FF
00
00
00
00
00
00
FF FF
Figure 4: A simulation result. F’s represent firing cells; o’s represent silent cells; the left-hand side of each column indicates F’s in A; the right-hand side indicates F’s in the sequel sZB. The prompting set B was obtained by replacing all F’s by o’s in the lower half of the A pattern. Parameters: M = 300.N : 1700. ~ I I= 101.p = 0.1.0 = I / = 28.0‘ = 20. -. = -,‘ = 4 . p = 0.2.6 = 0.5. huge number of cells needed for complete correlations. Crucial to the success of the recall is the existence of feedback pathways from the correlator cells to the sensory cells. Without them, the stored correlations would be lost. The feedback pathways need not be monosynaptic. For simplicity in the model, they were assumed to be one-to-one and established together with the direct connections. In a more general model, the feedback pathways could connect groups of correlator cells with groups of sensory cells. Such feedback architecture has been observed in many parts of the brain (see, e.g., Goldman-Rakic 1988). 4.2 Robustness. For simplicity in the model, many parameters were taken as basic. In a more detailed version, they themselves would appear as averages of varying quantities, with little change in the calculated results. Thus stimulus sets A could vary in size with average m, thresholds could vary from cell to cell with average value 0, and synapse enhancements could have a considerable distribution but with average ? . Since each such average extends over a huge number of instances, moderate numbers of changes in individual cells or synapses would scarcely affect the averages. In this sense the model is robust. As noted in Section 3.3, the results d o depend sensitively on the parameters-on these averages.
Neural Correlation via Random Connections
1723
The conjecture here is that in actual networks many complex intracellular mechanisms, as well as interneurons (see Section 4.3), serve to keep key averages fairly constant.
4.3 Inhibition. Although not appearing explicitly in the model, inhibitory interneurons would play important roles in any realization. In the long term, they could serve to regulate the virtual value of firing thresholds, as needed for changing network characteristics. For example, an increase in the number (k) of learned stimuli increases the density ( p ) of enhanced synapses, which in turn will raise potentials (Vc.V s ) . This requires a compensating (virtual) rise of firing thresholds (0.0’) or, equivalently, a graded introduction of inhibition, in order to avoid overfiring in the network (e.g., the case in Section 3.1). Interneurons can provide such graded inhibition if they themselves receive a sampling of enhanced and unenhanced inputs and if among the interneurons there is a distribution of firing thresholds. Thus, an increased amount of input to interneurons will cause a greater proportion of them to fire and proportionately increase inhibition in correlator (C) or sensory (S) cells. In the short term, inhibitory interneurons can help respond to changing afferent inputs. At any instant, S cells may be receiving new or repeated afferent inputs plus feedback from correlator cells due to previous activity in S. All of these inputs contribute to the size m of the set A firing in S at the next instant. If the latter increases by more than a few percent over its favorable value with respect to the other parameters, S will be overloaded several instants later. Hence feedforward interneurons may be required to dampen afferent inputs or correlator inputs arriving to S cells, or feedback interneurons may be required to dampen the overall activity in S before much of it can arrive to C (see Section 4.4). Such dampings may, of course, affect the fidelity of a sequel sFB within a set A, as well as diminishing extraneous firing.
4.4 Recognition. How well an arriving stimulus correlates with some previously trained stimuli can be monitored at several levels of the system. If the new presentation induces the firing of a subset B of S, then, most grossly, recognition can be measured by cells that monitor the firing level in cB, such as suitably thresholded feedback interneurons interspersed among C cells and connected with them locally. More finely, recognition can be monitored by interneurons dispersed among S cells. These would measure the extent to which the sequel scB, when it arrives at S, is complemented by a repeated stimulation of the B cells-repeated but perhaps diminished by feedforward inhibition (see Section 4.3). Such monitoring interneurons can then either inhibit further reaction to the stimulus (e.g., if it is boring) or turn up neuromodulators to ”focus” on the stimulus if it contains S cells representing importance.
Joshua Chover
1724
4.5 Load, Capacity, and Iteration. The lower bound for load values ( k ) given in Section 3.4 depended on system parameter values and sizes of prompting subsets B and on a single S C S round trip. Those k bounds are not as high as the number of ”memories” storable in an abstract Hopfield network (Hopfield 1984) of the same size. In the latter setup, however, memories are patterns approached through many iterations, as against a single iteration here. No attempt is made here to calculate “capacity” (maximal load) for the present system since that would involve allowable threshold changes (see Section 4.3) not explicit in the model. A second iteration ran sometimes help climb the RR[h] curves (see Fig. 3) in the present model (provided (4 5 R R [ h ] ) .(For example, if B is 49% of A and its sequel 5i-B intersects 75% of A , then the sequel of s?B will intersect almost loo‘% of A , ) Thus good recall may occur within 100 ms. Although recall cannot be initiated by very small prompting sets B , note that B itself may refer to a multisensory stimulus.
- -
4.6 The Plasticity Threshold. Examples calculated here have been for ,/ = H (Hebb’s rule). Note, however, that taking r/ < H would increase the size of i-B, the number of cells firing to correlate B with A - B, and this would decrease the number N of C cells needed by the network to produce given recall rates. The condition < H requires that synaptic enhancements can take place without actual firing of the postsynaptic cell (see, e.g., Alonso ct 171. 1990). The penalty for an increase in ]FBI, however, is a decrease in the capacity k. 4.7 Timing and Structure. A more realistic model would take account of a distribution of axon velocities, and internal cell time constants, 5 0 that the i-B and si-B firing might be spread out or multiplexed among more time instants. More structure would appear spatially also, with S and C subdivided into components between which random connections might prevail and with plasticity thresholds governing local groups of synapses. (Also a full treatment of the dynamics of the model would have to consider decay of synaptic enhancements together with reinforcements.) 4.8 Typical Behavior. The results in Section 3 are for a typical configuration of S C connections (a typical mouse) or an average over randomly chosen firing sets A,. It remains true, however, that for a given network one can find firing sets A for which recall will be difficult. (For example, if S = C with geographically local connections and if B is a narrowly confined portion of a widely scattered A , then recall could spread from B only through many iterations. if at all.) Natural systems may respond well only to stimuli that can be preprocessed to “match” the existing S C configurations.
-
-
Neural Correlation via Random Connections
1725
4.9 Extraneous Firing. Rather than being considered spurious, it may be viewed as the key agent for associative mentation. 4.10 Approximation Errors. The calculations in the Appendix are based on normal approximations to binomial distributions, and the errors so incurred tend to zero as the basic size parameters M , N . and m increase. Computations by Jon Kane (not shown here) indicate that for suitable values of the parameters (satisfied by the case in Section 3.1), the convergence of right-hand binomial tails (such as appear in V, and V s )is monotone downward, even under certain conditioning. The finite tail sums, and ratios thereof, may thus exceed their corresponding normal approximations by 6 to 23% (see the comments on approximation in Chover 1994, Appendix). This may explain some especially good recall results in toy simulations such as that of Figure 4. In any event, the quantitative relationships developed here are primarily meant to be suggestive for further explorations.
Appendix: Calculations For a fixed subset A of size m and a typical cell c, the potential V, = Vf + V,: where V,‘ sums contributions through unenhanced synapses and V: through enhanced synapses. Since all summands are independent, Vj has a binomial distribution with mean p 1 = (1- p)pm and variance n: (1 - p)pm (here, and below, terms in p2 are omitted since p << 1). Similarly, y l V : is binomially distributed with mean pr = ~ p p mand variance 0; y’ppm, and is independent of VT. Thus V, has mean p = ( ~ p + l - p ) p r n and variance g2 (Y’p+l-p)pm, and P[Vc2 HI = ~ * [ b ] , where /-I := (H-p)/n, $[/3] := 1 ; p(s)ds and p[s] := (2r)-’/2exp[-(1/2)s2]. The expected size ElcAl = $ [ / 3 ] N . Similarly (considering only the case 71 5 H ) the expected size EITA] = ql[3]Nwhere fi := ( t i - p ) / n , since the condition for c to be in ?A is that V , 2 ri. Regarding C --+ S feedback, for fixed s in A and c in ?A, the conditional probabilities := P[s+c(V, 2 111 and p := P[s--tyc(s--tc.V , 2 r ~ ]may be estimated as follows. Let
-
-
-
e := E[V,/V, 2
and
111
lr]x sp[(s
- /’)/“]n-lds
=
(v[j])-1
=
PIT/) + (1 - P ) + (P[dl/d@1)((Y2/-, + 1 + P)/pm)”zl
Joshua Chover
1726
Now P, = nzP[s--c.sfi-,c~V, in A. Hence
p
= ~ j s - + c l V2, 111
2 111
and e - el
=
?niP[s+c.s-.c/V,
+ :,-’!e
-
el))
= iv’jel =
7rl-1(ylpi-
2
111
for s
(A.1)
(1- :,?)q).
= P[s +., C/S-C. V , 2 / I ] = - , - ‘ ( t ’ - r l ) / ( e l + y ’ ( t 7 - e l ) ) = ( e )I(<+ I ,; - l ) e , ) , the number of inputs to c through enhanced connections
Also p L’~
divided by the total number of inputs to c. Note that for a naive network ( p = ()).my= r i z p ( 1 + (;[.-1]/1.[.~i(1?111)-”*), so that i 7 > p. However, when (1 > 0 it can happen that /i > 1) but lj < p, a situation that requires a feedback gain -.’ > 1 to favor signal !in RR) over noise (in XX) in the expressions below. Suppose now that repeated presentation of A has enhanced all connections between A and ?A.Because connections s i c are independent, Z := iZA1 is binomially distributed with mean p7 = 1 .[*?INand variance 1.5 i,i.j]N(since t ~ [ . > ]<< 1). For s in A, the potential V, at s due to firing in ?A is the sum of conas exist between tributions through such (enhanced) connections C-,,S : -;‘PZ and variance ofr(?’)?PZ i; and c in FA. Thus V , has mean p i : (since i~ << 1 also); and
-
-
P ~ V2,
-
0’1
l.j((o’/-,’)- y z j / ( p i j l
/%
...
~I;[(Z -
~.~[.i]~)/(~~[.j1~)li21
. (/.,!.>]N)-’s2dz,
Mihen oTI<< I / / , the mass of the integral is concentrated near z and
P!V, 2 H’]
-
c,[((H’/-,’)
-
=
c,[.l]N,
jj/.,j.j]N)/’:Pc,[ >IN)’I ]
which estimates (l/itI)EIA17slAj. For fixed s not in A, the connections between 2A and 5 are independent of the V,. 2 // conditioning. Hence, the original p and (rather than p and 6) apply to calculations for V. = V! + V:’, where now the superscripts indicate contributions to V , through unenhanced and enhanced connections. V! and ( li-/’)V; are binomially distributed each with random length Z and with respective parameters (1 - p ) p and p p . Approximating Z again by its mean,
XR
=
-
(l/T?i)E/s?A- A / = ( ( M - tt~),’tt~)P[V, 2 H’] (Mtti I
-
1)
x ( s [ ( H ‘ - (*,’/)
-1
-
(A4 ,,IF’I.[~]N)!’(((-.’)~,, + 1 - /,)~c’[,>]N)I’*].
Consider a possible firing set B c A of size jBl = h m . Then for fixed c in C, Vf = Vp V4-H where superscripts indicate the sources of AP’s arriving at c‘ due to firing in A at a given instant. The potential V,” can be decomposed further as VB = VF’ + IT:-,, where, as above, superscripts
-
Neural Correlation via Random Connections
1727
now distinguish transmission through unenhanced and enhanced connections, respectively. Suppose again that A has fired repeatedly so that all connections between A and SA are enhanced. Suppose further that a new stimulus induces the set B as a firing set. For fixed c in C, the new potential at c will be V: = yV,"' V,"'. The criterion that c now be in the sequel EB is that (a) V,"' V,"' Vf-B 2 7/, the plasticity requirement for the change from V,"' to rV,B1, and (b) rV,"' V,"' 2 8. (For q 5 8, (a) implies that EB c SA.) The variables V,"'. V;y, and V;"-B have respective means p g l = (1- p)pSm, LLB? = yppbm, i ~ A - 8 = ( y p 1 - p)p(l - 6)m, and variances CT& (1- p)pbm, CT& y2pp6m,02-B ( y 2 p 1 - p ) p ( l - 6)m; and are mutually independent. Hence the probability that c be in SB is
+
+
-+
-
-
q[6]:
+ +
+
+
+
+ v,"'
P[V,"' v,"' vt-6 2 7/. yv,B' 2 81 P(V,"' 2 max(71- V,"' - V;4-B.6 - yv,"')]
= =
x
qj[(?/ -
/LB1
- gBlS - / L A - B - r A - B t - /LBy)/gB7]
x
+
L
ds(F[s]d~[k[s]]~~[(o - '?/LBl
-
ygBls - /LBy)/gBy].
--
(A.3)
where h[s]= ( ( ~ / - ~ ) + ( ~ - ~ ) ( L L B ~ + ~ B ~ S ) - / L A - B ) / Then ~ A - B ElSB( . q[h]N, and the variable Z B := ISBJ is binomially distributed with mean /L; 4[6]N and variance cr; q[6]N(when q << 1). The estimate for EI(A - B ) n sSB1 is similar to that for EIA n sEA1. The connections between A - B and SB are independent of those between B and SB, but are part of those between A and EA;thus the parameters p and 3, apply. Since all connections between A and SB c SA are enhanced, the potential V," at a fixed s in A - B due to firing in SB will have mean ,I.;, = y'pZB and variance (o;,)~ (r')'pZB (when j << 1). Again, when o; K P;,
-
N
p[v: 2 6'1
-
"(Q'/y')
-~
~ ~ ~ I ~ ~ / ( P 4 ~ ~ 1 ~ ~ (A.4) ' " 1 ~
which estimates RR[6]= (1/(1 - 6)m)EI(A - B ) n sZ.BI. Regarding the effects of multiple A,, suppose that k stimulus subsets A,(1 5 i 5 k ) were chosen randomly (with possible repetitions) from among the (f) subsets of size rn in S and that connections were enhanced between A, and SA, for each i. The sizes ISA,( would increase for later choices because of earlier enhancements. To get a conservative estimate ko of how large k must be to produce a probability p that a given connection s+c be enhanced after all A, are chosen, suppose that each SA, is induced through connections that already have p as their enhancement probability, for 1 i < ko. Then the probability for s to be in A, and c to be in ?A, is ( m / M ) $ [ j ] The . connection s-+c then has probability (1- ( W Z / M ) $ [ P= ] )1~- ~p of remaining unenhanced through all ko choices. Since ln(1 - ( m / M ) $ [ f i ] ) -(m/M)li,[/?],
<
k > ko
-
-
-
In(1 - p ) / ( ( m / M ) $ [ P I ) .
(A.5)
Joshua Chover
1728
For a fixed subset B c S of size b < rtr and for fixed i < j , the probability PIB c A, n A,] = (‘:I;’)’/ (‘>:)’. Since there are choices of B and choices of i < j , the probability
(y)
(r)
P[some B with lBl
=
b lies in at least two A,]
(:)
< (~~(M~~)~(~)-~ < k 2 ( ~ ~ A > i bb) () i~’ / 2 ( 2 ~ b ) ’
’
-
(A.6)
(using Stirling’s formula) Now choose a set B, c A , with IB,l = hr71 for each I In order to calculate variances such as Var[k-’ Cf,,((l- >)riIl-’l(AJ - B , ) n s?B,l], let :’1 (for s in A, - B,) be the potential at 5 due to firing in 2B, For I # I R and 5 # q’, VR and V , are independent Hence, for I < I ,
EIIA, - B , ) n GB,l I(A,- B , ) (1 s?B,/ -
c +. c
.
i
111
-n
I
P[Vp 2 0‘. V,R 2 0’’ I
1-R
i -5,
c
‘(*.I in
P1V.R 2 H’]P[v: 2 H’] \
(A 7 )
R
Since lA, - B,/ = (1 - 8 ) m for each r , IJIV: 2 H’] = RR[b] for s in A, - B,. Suppose now that IA, n A,l 5 if for all 1 < I . Then subtraction of ( ( 1 ?IniXR[h]1’ from the cross-moment in equation A.7 gives a bound on the covariance
whence A
Var[k-l c(il - h ) m - ’ I ( A , - B , ) fisL?B,l] < (k-’ I=
I
Chebychev’s inequality now gives
Similar arguments give
+ 2dRR[h])((l
-
h)mP2
Neural Correlation via Random Connections
1729
References Alonso, A,, de Curtis, M., and Llinas, R. 1990. Postsynaptic Hebbian and nonHebbian long-term potentiation of the synaptic efficacy in the enterhinal cortex in slices and in the isolated adult guinea pig brain. Proc. Natl. Acad. Sci. U S A 87, 9280-9284. Chover, J. 1994. Recall via transient neuronal firing. Neural Networks 7, 233-250. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., and Singer, W. 1992. Temporal coding in the visual cortex: New vistas on integration in the nervous system. TINS 15 (6), 118-226. Goldman-Rakic, P. S. 1988. Topography of cognition: Parallel distributed networks in primate association cortex. Ann. Rev.Neuroscience 11, 137-156. Haberly, L. B. 1985. Neuronal circuitry in olfactory cortex: Anatomy and functional implications. Chem. Senses 10, 219-238. Hebb, D. 0. 1949. The Organization of Behavior. John Wiley, New York. Hopfield, J. J. 1984. Neural networks and physical systems with emergent collective computational ability. Proc. Natl. Acad. Sci. U S A 79, 2554-2558. Ketchum, K. L., and Haberly, L. B. 1993. Membrane currents evoked by afferent fiber stimulation in rat piriform cortex. 11, Analysis with a system model. 1. Neurophysiology 69 (l), 261-281. Levy, W. B., and Steward, 0. 1983. Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neuroscience 8 (4), 791-797. McNaughton, 8. L., and Morris, R. G. M. 1987. Hippocampal enhancement and information storage within a distributed memory system. TINS 10 (lo), 408415. Nicoll, R. A,, Kauer, J. A,, and Malinka, R. C. 1988. The current excitement in long-term potentiation. Neuron 1, 97-103. Traub, R. D., and Miles, R. 1991. Neural Networks offheHippocampus. Cambridge University Press, Cambridge. Treves, A., and Rolls, E. T. 1991. What determines the capacity of autoassociative memories in the brain? Network 2, 371-397. Willshaw, D. J., Buneman, 0. P., and Longuet-Higgins, H. C. 1969. Nonholographic associative memory. Nature 222, 906-962.
Received October 2, 1995; accepted March 14, 1996.
This article has been cited by:
Communicated by Mikhail Tsodyks
Hebbian Learning of Context in Recurrent Neural Networks Nicolas Brunel* INFN, Sezione di Roma, lstituto di Fisica, Uniuersita d i Roma 1 'La Sapienza,' P.le Aldo Moro 2, 00185 Rome, ltaly Single electrode recordings in the inferotemporal cortex of monkeys during delayed visual memory tasks provide evidence for attractor dynamics in the observed region. The persistent elevated delay activities could be internal representations of features of the learned visual stimuli shown to the monkey during training. When uncorrelated stimuli are presented during training in a fixed sequence, these experiments display significant correlations between the internal representations. Recently a simple model of attractor neural network has reproduced quantitatively the measured correlations. An underlying assumption of the model is that the synaptic matrix formed during the training phase contains in its efficacies information about the contiguity of persistent stimuli in the training sequence. We present here a simple unsupervised learning dynamics that produces such a synaptic matrix if sequences of stimuli are repeatedly presented to the network at fixed order. The resulting matrix is then shown to convert temporal correlations during training into spatial correlations between attractors. The scenario is that, in the presence of selective delay activity, at the presentation of each stimulus, the activity distribution in the neural assembly contains information of both the current stimulus and the previous one (carried by the attractor). Thus the recurrent synaptic matrix can code not only for each of the stimuli presented to the network but also for their context. We combine the idea that for learning to be effective, synaptic modification should be stochastic, with the fact that attractors provide learnable information about two consecutive stimuli. We calculate explicitly the probability distribution of synaptic efficacies as a function of training protocol, that is, the order in which stimuli are presented to the network. We then solve for the dynamics of a network composed of integrate-and-fire excitatory and inhibitory neurons with a matrix of synaptic collaterals resulting from the learning dynamics. The network has a stable spontaneous activity, and stable delay activity develops after a critical learning stage. The availability of a learning dynamics makes possible a number of experimental predictions for the dependence of the delay activity distributions and the *Current address: LPS, Ecole Normale Superieure, 24 rue L'homord, 75231 Paris Cedex 05, France
Neural Computation 8,1677-1710 (1996) @ 1996 Massachusetts Institute of Technology
1678
Nicolas Brunel
correlations between them, on the learning stage and the learning protocol. In particular it makes specific predictions €or pair-associates delay experiments.
1 Introduction 1.1 Correlated Delay Activities: Experiment and Theory. In the past 20 years there has been a wealth of evidence for the existence of local reverberations of cell assemblies in the inferotemporal cortex (Fuster and Jervey 1981; Miyashita and Chang 1988; Miyashita 1988; Sakai and Miyashita 1991; Tanaka 1992), prefrontal cortex (Fuster 1973; Niki 1974; Goldman-Rakic 1987; Wilson ef al. 1993), and other areas of primates during delayed visual memory tasks (for a review see Fuster 1995). Together with experimental data, models have been proposed to account for the persistent delay activities (Dehaene and Changeux 1989; Zipser e l al. 1993, Griniasty et al. 1993), in which excitatory recurrent synapses store the information about the visual stimuli. The experiments of Miyashita (1988) on the activity in the inferotemporal (IT) cortex of monkeys trained to perform a delayed matching to sample task have disclosed significant correlations in the persistent activation of cells in the delay period following the presentation of uncorrelated stimuli, when they are presented during training in a fixed sequence. Theoretical studies (Griniasty ef a/. 1993; Amit rt al. 1994; Brunel 1994) have demonstrated that attractor neural networks, which embed in their synaptic structure information about contiguous stimuli learned in a sequence, have correlated delay activities even though the learned stimuli are uncorrelated.' It may be worth pointing out that when stimuli arrive at IT, they may be uncorrelated because they have been so prepared or because they have been decorrelated on the way (Barlow 1961; Linsker 1989; Atick 1992). In the model networks, the delay activity provoked in the neural assembly by the presentation of a given learned stimulus is correlated with the delay activity corresponding to other stimuli until a separation of several stimuli in the training sequence, despite the fact that the synaptic matrix connects only consecutive stimuli in the sequence. The appearance of such correlations between the different delay activities is a transcription, during the learning process, of temporal correlations in the training information, into spatial (activity distribution) correlations of the internal representations of the different stimuli. The network therefore has a memory of the context of the presented stimuli. Some cognitive implications of this context sensitivity have been outlined in Amit (1995). 'In these models an attractor is defined as the stable state reached by the system after the presentation of a stimulus, that is, the ensemble of persistent delay activities in the network.
Hebbian Learning of Context in Recurrent Neural Networks
1679
The model simulated by Amit et al. (1994) consists of a network of integrate-and-fire neurons represented by their current-to-spike rate transduction function (Amit and Tsodyks 1991). Such neurons are taken to represent the excitatory neurons of the network, the pyramidal cells. It is in the synaptic matrix connecting these neurons that learning is manifested. The synaptic matrix, representing the training process, is constructed to represent the inclusion of the information about the contiguity of patterns in the training sequence, as in Griniasty et al. (1993). Inhibition is taken to have fixed synapses, and its role is to react in proportion to the mean level of activity in the excitatory network, so as to control the overall activity in the network. The delay activities are investigated by presenting to the neural module one of the uncorrelated stimuli as a set of afferent currents into a subset of the excitatory neurons. These currents are removed after a short time, and the network is allowed to follow the dynamics as governed by the feedback represented in the matrix of synaptic collaterals. Eventually the network arrives at a stationary distribution of spike rates, that is, an attractor. This is the delay activity distribution corresponding to the stimulus that excited the network. Simulations of the model (Amit et al. 1994) are in quantitative agreement with the experimental data of Miyashita (1988). The dynamics of the model has been solved analytically in simplified conditions (Brunel 1994). This makes possible the explicit calculation of the correlations between the internal representations as a function of the parameters of the model. The main parameters controlling these correlations are the strength of the inclusion of the contiguity between stimuli in the synaptic matrix, relative to the strength of the inclusion of the stimuli themselves, and the balance between recurrent excitatory and inhibitory synaptic efficacies. The analysis deduces the mean fraction of neurons activated by a given stimulus (coding level, or sparseness) in the observed region, from the experimental data of Miyashita (1988). This makes possible the calculation of the correlation coefficients, which are again in quantitative agreement with all the available experimental data (see Fig. 9 of Brunel 1994) and the simulations of Amit et al. (1994). These previous studies used a fixed, prearranged synaptic matrix. In Amit et al. (1994) and Brunel (1994), the matrix was chosen to be similar to the Willshaw matrix (Willshaw et al. 1969), with a limited number of synaptic states. Memory is coded exclusively in the excitatory-toexcitatory synapses. An important result (Amit et al. 1994) is that the correlations are rather insensitive to the particular matrix chosen, provided it is Hebbian and that it includes the memory of the contiguity between stimuli. What is missing in these studies is a plausible dynamic learning process leading to a synaptic matrix that incorporates information of the temporal context of the stimuli shown to the network. A simple Hebbian stochastic learning process has been discussed in Amit and Fusi (1994) and Amit and Brunel (1995),but such a learning process does not
1680
Nicolas Brunel
lead to temporal correlations in the formed attractors. The problem of learning the temporal associations of stimuli is the issue addressed in the present study. 1.2 The Present Work. In this paper we discuss a possible scenario for learning in the presence of delay activity, which naturally leads to the inclusion of temporal correlations between stimuli in the synaptic matrix. The scenario is that first uncorrelated attractors are formed. An attractor then carries information from the stimulus that provoked it until the presentation of the next stimulus. This information allows for a simple synaptic mechanism to store the memory of the context of any stimulus. We study the case of a finite set of stimuli that are repeatedly shown to the network. In the simplified case in which every excitatory neuron in the network is activated by at most one stimulus (Brunel 1994), it is possible to calculate explicitly the probability distribution of every synaptic efficacy as a function of the learning procedure. If stimuli are shown repeatedly in a fixed order during learning, the resulting synaptic matrix is similar to the fixed matrix used in Amit e t a / . (1994) and Brunel (1994). Given the synaptic matrix, we solve for the neural dynamics of the attractor network as in Brunel (1994), when one of the stimuli is presented. The network we study is composed of a large number of excitatory and inhibitory integrate-and-fire neurons, described by the statistics of their afferent currents and their spike emission rates. The network represents a local module, similar to a cortical column, embedded in a much larger sea of neurons (the entire cortex). The module can be distinguished from the global network by two features: the high local excitatory connectivity and the range of inhibitory interactions (Braitenberg and Schiiz 1991). Such a network has a stable state of low activity in which all neurons have a spontaneous activity of the order of one to five spikes per second in a plausible region of parameters (Amit and Brunel 1996). Furthermore, when learning occurs in the local module, and the synaptic modifications are strong enough, a set of attractors correlated with the stimuli presented to the network develops. In each attractor a small subset of the excitatory neurons-the neurons activated by a particular stimulus-have elevated delay activities, on the order of 20 to 40 spikes per second. We chose to study both learning and retrieval dynamics in this network since the activity in its attractors is roughly in agreement with recorded data during visual memory experiments in both the inferotemporal and prefrontal cortex. When learning occurs in the present network, on repeated presentation o f stimuli, uncorrelated attractors are initially formed. These attractors make possible the inclusion of temporal correlations between stimuli in the synaptic matrix. This in turn provokes significant correlations in the delay activities corresponding to stimuli that have been shown repeatedly contiguously to the network. Therefore the correlations between the internal representations of different stimuli reflect their context.
Hebbian Learning of Context in Recurrent Neural Networks
1681
Using a plausible learning process, one reproduces the results found in the previous studies, which are in good agreement with experimental data (Miyashita 1988). This is not surprising since the synaptic matrix resulting from many presentations of the stimuli is quite similar to the matrix that was postulated in Amit etal. (1994)and Brunel(l994). One essential novelty is that the entire phenomenon takes place in the presence of stable spontaneous activity. The advantage of using a more realistic neural model is that neurons have both spontaneous and selective activity roughly in the range of the recorded data. The analysis of the learning dynamics made in this paper allows the prediction of: 0
0
The evolution of the delay activities and the correlations between the internal representations during training, for a fixed training procedure. The dependence of the correlations on the training procedure.
The predictions of the theory are accessible to experiments as in Miyashita and Chang (1988), Miyashita (1988), and Sakai and Miyashita (1991). We focus the analysis on two particular cases: 1. Training with stimuli in a fixed sequence (as in Miyashita 1988). 2. Training with associated pairs (as in Sakai and Miyashita 1991). A set of stimuli is divided into pairs; stimuli in each pair are presented in fixed order; and pairs are presented at random. We also show how it is possible to deal with intermediate cases, as when the sequence of stimuli is interspersed with random items. The paper is organized as follows. In Section 2 we present in detail the model network and its elements. In Section 3 we present a simple scenario of synaptic dynamics that incorporates both associative long-term potentiation (LTP) and long-term depression (LTD). Then we describe a typical protocol of a visual memory experiment in which a delay period always follows the presentation of a stimulus. We show that in this situation, the analog synaptic dynamics reduces to a stochastic process acting on a two-state synapse. We then study in detail which kind of synaptic transitions may occur, depending on whether there is selective delay activity following the presentation of a stimulus. In Section 4 we study the situation of a small set of stimuli repeatedly shown to the network. In this case we calculate explicitly the probability distribution of the synaptic efficacies of the network as a function of the learning stage and of the learning protocol. Then, in Section 5, we study the network dynamics and show the influence of the synaptic dynamics on the delay activity, which is stabilized by the network after the presentation of a learned stimulus. This allows the study of the structure of the delay activity distributions as a function of the learning stage and the learning protocol.
Nicolas Brunel
1682
2 The Model Neurons
Each neuron in the network receives three types of inputs: from recurrent (collateral) excitatory connections from other neurons in the same network; from inhibitory neurons inside the network; and from excitatory neurons in other, unspecified, areas. The collateral connectivity in the network has no geometric structure; a neuron has equal probability (about 0.1) of having a synapse on any other neuron. Both excitatory and inhibitory neurons are leaky integrate-and-fire neurons described by the statistics of their input currents, which determines their firing rates (Amit and Brunel 1996). Each type of neuron is characterized by a threshold H,,, a postspike hyperpolarization H,,, and an integration time constant T,,, with ( I = E . I indicating whether the neuron is excitatory or inhibitory, respectively. A neuron i of type o receives a large number of afferent spikes per integration time (Amit and Brunel 1996), and hence a gaussian white noise input current of mean 1;' and standard deviation m i ' , through C,, synaptic contacts, which are divided in C,,i excitatory synapses and C,,, inhibitory ones. The synapses in the network are of four types, depending on all the possible types of pre- and postsynaptic neurons. For each synaptic type, the efficacies I,, ( i and j denote the post- and presynaptic neuron, respectively) are drawn randomly from the distribution P,, {(I)( ( I and ,1 denote the type of post- and presynaptic neuron, respectively). P,,., has mean and standard deviation J<,,{A,where A represents the variability in the synaptic amplitude. A fraction s,,of the excitatory connections on a neuron of type ( t arrive from outside the network. The excitatoryto-excitatory connections are plastic: the distribution PFr(J) specifies the distribution of excitatory-to-excitatory links before the learning stage. As we will see, learning will modify this synaptic distribution. The spike rate of excitatory neuron i is v,!. The rate of inhibitory neuron is i 11:. The input currents from outside the column are described by a white noise with mean and standard deviation 07'. The input currents are provoked, in the absence of a stimulus, by the background activity outside the network. In the presence of a stimulus, the input currents are the sum of the background input and the input provoked by that stimulus. We assume that the correlations between the spike emission times o f different neurons in the network d o not affect their spike rates significantly. Thus we consider the spike emission processes of different neurons in the network as uncorrelated. In this case the mean and variance of the input current to a neuron in the module are the sum of three independent contributions, coming from external excitatory, recurrent excitatory, and inhibitory currents (see Amit and Brunel 1996):
c"
Hebbian Learning of Context in Recurrent Neural Networks
120 I
1683
I
I
I
I
I
I
I
5
10
15
20
25
30
35
100
80 60 v (s-1)
40 20 0
0
40
I (mv)
Figure 1: Current-to-frequencytransduction function v = d(1.a) for H = 20 mV, H = 0, 7 = 10 ms, TO = 2 ms and three values for the amplitude of the fluctuations of the currents a = 0 (full line), 2 mV (dashed line), and 5 mV (dotted line). and
These currents are integrated by the membrane depolarization at the soma with a time constant 7,. The firing rate of neuron i of type (t is given by
where
is the transduction function (Ricciardi 1977), which depends on the absolute refractory period 70, the threshold O,, and postspike hyperpolarization, or reset potential, Ha. The function 4 is plotted as a function of I for three different values of CT in Figure 1. It shows that the fluctuations of the currents have a significant effect on the spike rates when the average current depolarizes the neuron below threshold. Note that the precise form of the transduction function, equation 2.3, is not necessary for the qualitative features of the behavior of the network. In the following we take BE = 6’1 = 20 mV above the resting potential; HE = HJ = 0; TE = 10 ms; = 2 ms; and 7 0 = 2 ms.
Nicolas Brunel
1684
E Crr = 20000; The connectivity parameters are xF = 11 = 0.5; C ~ = and C ~ =I GII = 2000. The average synaptic efficacies are expressed by the amplitude of the (excitatory or inhibitory) postsynaptic potential provoked by a spike, and thus in units of the potential: = 0.04 mV; Ill: = = 0.14 mV; 1" = 0.05 mV. The synaptic variability is taken to be 1 = 1. The synaptic external input has mean P"' = 11 mV and RMS d"'= 0.9 mV into excitatory neurons and P"' = 8.6 mV and RMS 1.6 mV into inhibitory neurons. These currents correspond to the activation of all the excitatory synapses coming from outside the network at a background rate of 3 sC1. For these parameters, the network has a stable state of spontaneous activity in which excitatory neurons emit about 3 spikes per second, while inhibitory ones emit 4.2 spikes per second. Note that this set of parameters is in a biologically plausible region (Braitenberg and Schiiz 1991; Komatsu ef al. 1988; Mason et al. 1991). The excitatory-to-excitatory synaptic efficacy is slightly smaller than the reported range of unitary excitatory postsynaptic potentials (EPSPs) in the neocortex and hippocampus, but we have here a neuron that sums linearly its inputs. When the input is nonlinear, a larger number of EPSPs are necessary to reach threshold than for a linear input, so the effective synaptic efficacy would be smaller than the reported values in thc case of a large number of inputs. In fact, the qualitative features to be discussed are fairly robust to small changes in the synaptic efficacies. If the inhibitory efficacies are weakened too much relative to the excitatory efficacies, the spontaneous activity state becomes unstable (Amit and Brunel 1996). 3 Learning Dynamics 3.1 Analog Short-Term Synaptic Dynamics. Excitatory-to-excitatory synapses in the network are plastic. Hebbian learning is modeled by a synaptic dynamics that incorporates both associative LTP and LTD (Amit and Brunel 1995): Tc/ll(t)=
-/iiif)+ c l i i t i
(11 lo)(+(]!l(fI- 1 ~ 1 ~ ~ + ( f 10. ) ) ~
(3.1)
It is basically an integrator with a time constant T ~ . , The integrator has a structured source c0(t ) , representing Hebbian learning. This source is given in terms of the neural rates, r:(t) and u J ( t ) of , the two neurons connected by this synapse as cJt)
= X+v,(t)r/,(t) A C \ / / l ( t ) ~
+ //!(t)].
(3.2)
X are positive parameters separating potentiation from depression. Their values are chosen so that when the rates of both neurons are high, c,, >O; if one is high and one is low, c,, <0; and if both are very low, cli is negligible. ~
Hebbian Learning of Context in Recurrent Neural Networks
1685
The last term on the right-hand side of equation 3.1 is the "refresh" mechanism discussed in detail in Badoni et al. (1995). It represents one way of preventing the loss of memory due to the decay of the integrator when no source is present. If at any given moment the source c,,(f) exceeds the fluctuating threshold wll(t ) , a refresh source turns on to drive the synapse to the high value 11.If later the source vanishes, this synaptic value will remain above its threshold, and the efficacy JI will be stable indefinitely. On the other hand, if the instantaneous synaptic value is low, either because it started low or because it was high and the learning source was negative enough, the refresh source turns off, and in the absence of a source, that synapse decays to 10. This is the other longterm, stable state of a synapse. The transition of a synapse from the lower stable state to the upper one is identified with LTP. The opposite transition is LTD. This type of learning is effective in the sense that it can be (and has been) implemented in a material device (Badoni et al. 1995). It also incorporates the experimentally characterized distinction between short-term synaptic plasticity, represented by the analog dynamics driven by the source cll in equation 3.1, and long-term changes, represented by the stable synaptic states 11and Jo separated by the threshold (see, e.g., Bliss and Collingridge 1993). The threshold is taken to be fluctuating to make the learning process more realistic. Here we have chosen to put noise on the threshold, but we could have chosen a fluctuating source cl,, whose average would be the right-hand side of equation (3.2). Interestingly enough, it has been shown that when synaptic transitions are stochastic, the capacity of the network is enhanced with respect to deterministic transitions (Amit and Fusi 1994),though learning will be slower. As a consequence, in the absence of the source term, each synapse has two asymptotically stable values, 10 and JI. We further assume that the fluctuations of the threshold are limited to an interval [Jo + O+. - 0-1. The fluctuating threshold therefore defines a potentiation threshold 0+ such that if the synaptic value is initially low, there is a finite transition probability Jo + JI when the source c,, > 0, and a depression threshold 0such that if the synaptic value is initially high, there is a finite transition probability 11 + 10 when c,, < -&. These thresholds are such that lo< Jo H, < J1 - 8- < 11. In Figure 2 we illustrate two examples of the evolution of the synaptic efficacy on presentation of a stimulus. In both cases the synaptic efficacy is initially at 10 and the source term c,, is higher than the threshold H+. In the upper part of the figure, the synaptic efficacy does not cross the fluctuating threshold and decays to its low stable value after the stimulus is removed. In the lower part of the figure, the synaptic efficacy crosses the threshold and is driven to the high state JI, which is stable in absence of a stimulus.
+
3.2 Learning Protocol and External Currents. The schematic learning protocol we model is as follows. The stimuli shown to the network
Nicolas Brunel
1686
0.2 0.15
I
I
I
J
, ..l.................................................-
J1 - 8......................................................
,,-.-"-",,
,,~,
/--*-
,.--\*--
Figure 2: Analog synaptic dynamics. Synaptic efficacy (solid line) initially at lo. A n external stimulus imposes cI, > H - during the interval 50 < t < 150. In the upper part of the figure, the synapse does not cross the fluctuating threshold (long-dashed line) and remains in its low state l ~In. the lower part of the figure, the synapse crosses the fluctuating threshold and makes a transition toward the high state ) I . Parameters: ],I = 0.04 mV; = 0.15 mV; 0, = 0.04 mV; H - = 0.04 mV (short-dashed lines).
are labeled by = 1 . .. . . p . During the presentation of stimulus / I , the mean external current received by a n excitatory neuron i is incremented selectively by I:',,$, where $' = 1.0 is the symbolic indication of whether cell i is activated by stimulus 1 ) . In the absence of a stimulus, the exci-
Hebbian Learning of Context in Recurrent Neural Networks
Stimulus po
Stimulus p1 Delay
1687
Stimulus p2
Delay
L...
Figure 3: Typical learning protocol in a visual memory experiment. Stimuli are presented in a sequence, with a delay between two successive presentations. The line represents schematically the level of external currents to the local network. tatory afferent is just the spontaneous noise. Inhibitory neurons are not activated by the stimulus. The presentation of a stimulus is followed by a delay period of length t d , in which the selective part of the current is removed. Therefore, a typical experiment can be schematized by Figure 3 in which presentation and delay intervals are kept fixed. The duration of each presentation t,, is taken to be much longer than the neuronal time constants 7 E . I . Thus t, >> 10 ms. Note that in a delayed matching-to-sample (DMS) experiment, the sequence of stimuli is an alternate sequence of sample and match stimuli. The match stimulus is typically taken to be equal to the sample stimulus with 50% probability, and another randomly chosen stimulus otherwise. The learning protocol specifies how the sequence of sample stimuli is presented (see below). To simplify the discussion, we suppose that when stimulus p is shown, the activated excitatory neurons go rapidly to a steady-state rate vl, v1 = (V - us)$
+ vs,
where us is the spontaneous rate of excitatory neurons, during presentation of the stimulus. When neuron i is activated by a stimulus, it goes to a high activity state V >> v,; if it is not activated, it stays at spontaneous activity levels. When the stimulus is removed, two possibilities may occur (Amit and Brunel 1995,1996):
1. The stimulus is unfamiliar. The network goes rapidly into its uniform, unstructured, spontaneous activity state, v, = v,
2. The stimulus is familiar. The activity of neurons activated during the presentation of the stimulus persists during the delay period
Nicolas Brunel
1688
but with lower rates than during the presentation I/,
= (Z’ - I / .
where V > i’ >
)/(I
+ I/.
11,
Following the delay period, when the next stimulus is presented, there is a short interval in which both neurons active in the delay period and neurons activated by the next stimulus will be active. Later inhibition turns off the activity of the neurons that participated in the attractor in the delay period, leaving active only those neurons tagged by the new stimulus (Amit and Brunel 1995). This transient interval is assumed to be short compared to the presentation time. It will be typically of the order of the integration time T F of an excitatory neuron. We further assume that the delay period is much longer than the synaptic integration time constant T , . In this case, in the absence of delay activity, all synapses in the network at the end of the delay period will have decayed to their asymptotic values-],, or 1,. 3.3 Synaptic Transitions: No Delay Activity Prior to Presentation. We first consider the case in which there was no delay activity before the presentation of the stimulus. When a stimulus is presented, one of eight situations may occur at a given synaptic site 1,. For each of the two possible stable values of the synapse (Jo, ]I) there are four pairs of activation states of the pre- and postsynaptic neurons by the stimulus: ( V .V), ( V . Oi, (0. V), and (0,O) (where the low spontaneous rate is represented by 0). Because we assume a symmetric role for pre- and postsynaptic neurons, cases ( V. 0 ) and (0. V ) are equivalent, and we consider only the case ( V 0 ) . The number of situations is reduced to six: 0
For ]!, = 10and ( v I .11,) = ( V . V ) : If the integrated synaptic source (equation 3.1) over the duration of the presentation t,, reaches the potentiation threshold,
there is a probability p+ of activation of the refresh source, causing a transition of the synaptic value to ]I in the delay period. LTP has occurred. This probability depends on c, = X+V2 - 2XLV, Q+, and the ratio t I , / 7 , . 0
For I, =, ]I and ( v I
11,) =
( V .0) or (0. V): If
the refresh source will be turned off with probability pL. J, goes to 10,its low value, in the subsequent delay period. This transition
Hebbian Learning of Context in Recurrent Neural Networks
1689
+
0
represents LTD. p - depends on c- = A- (V 74) - X+Vv,, 8- and the ratio tp/rc. In all other cases, no transitions can occur.
Therefore in the absence of delay activity and when the presentation duration is kept fixed, we can represent the synaptic dynamics by a discrete stochastic-a random walk between the two synaptic stable states 10 and 11. This is a familiar situation (Amit and Fusi 1994; Amit and Brunel 1995), in which uncorrelated stimuli lead to uncorrelated attractors. 3.4 Synaptic Transitions: Delay Activity Prior to the Presentation. In contrast, when neural activity persists during the delay period, the synaptic dynamics depends on the activation of the pre- and post synaptic neurons by the stimulus but also on the activation of these neurons during the previous delay period. There are now 32 possible situations, depending on whether Ill is above or below threshold before the presentation and on the pair ( 1 1 ~v,) . during both stimulus presentation and the previous delay period. Since the transient interval during which either old delay and new stimulus-related activities are present is short compared to the presentation interval, the probabilities p , and p - will not be much affected by the previous delay activity in the situations described in Section 3.3, where LTP or LTD occurs only due to stimulus presentation. A new LTP transition might occur. If before presentation J,, = 10, and during the transient interval rE
or
{ (v.
0) (0. V )
during the delay period during the stimulus presentation,
(3.3)
(v,.v,) = (0, v) (V.0 )
during the delay period during the stimulus presentation,
(3.4)
(vl,v,) =
{
and if the integrated source of the synaptic dynamics over rEcrosses the potentiation threshold,
V(X+v- A _ ) (I - exp
(-:))
-
A-v > 8,.
there is a probability ap,, of activation of the refresh source, which will drive the synaptic efficacy to 11 in the subsequent delay period. a is a function of the ratio rE/t, and of v/V. Typically if the presentation duration is much longer than T E , a << 1. A similar situation would occur also if (vl,v,) = (u, v) in the delay. However, in this case, the probability of LTP during the previous stimulus presentation is much larger than the one during the short transient period, and can be neglected. The only new situation leading to LTP in presence of delay activity is the one described in equations 3.3 and 3.4. We will see that this has important consequences for the synaptic matrix
Nicolas Brunel
1690
50
45 40 35 30 25 '/3
20 15
0
5
10 15 20 25 30 35 40 45 50
Figure 1:Regions where synaptic transitions occur in the ( vi. I / , ) plane. Frequencies are indicated in spikes per second. Above the dashed line, LTP transitions occur due t o presynaptic delay activity and postsynaptic activation by the new stimulus. In this case vi, is the delay activity prior to presentation of the stimulus.
i n the case of significant temporal correlations in the training sequence of stimuli, which in turn will affect significantly the neural dynamics. To conclude, we give a numerical example to illustrate the possible scenarios. We take the background synaptic efficacy JO = 0.04 mV, J I = 0.15 mV. The threshold for potentiation is 0, = 0.04 mV above 10, and for depression it is 0- =~0.04 mV below 11.The neuronal time constant is T! = 10 ms. The analog synaptic time constant is taken to be equal to the neuronal time constant, T, = 10 ms. This is consistent with the fact that stimuli shown during times of the order of 100 ms can be learned, which implies that 7; has to be shorter than 100 ms; otherwise, the analog synaptic value would not have time to reach the threshold zu,,. Note also that the results are not very sensitive to the precise value of T ~ ,as long as it does not become too long compared to the neuronal time constant. The presentation duration is t,, = 200 ms. For A+ = 5.10 mVs2, A- = 4.10-3 mVs, Figure 4 shows in the space ( vt.11,) the regions where potentiation or depression are possible.
Hebbian Learning of Context in Recurrent Neural Networks
1691
No delay activity b
a
c
LTP (prob. p + )
Ji j
J<j
t--
LTD (prob. p - )
Delay activity
Delay
'v
1
LTP (prob. up,)
Figure 5: Schematic illustration of synaptic transitions in three situations: time evolution of synaptic efficacy JI, (lower curves), presynaptic activity (v,), and postsynaptic activity (u,). (a) Pre- and postsynaptic neurons activated by stimulus, synapse initially low. (b) Presynaptic neuron silent during stimulus, postsynaptic neuron activated, synapse initially high. (c) Presynaptic neuron activated during stimulus, postsynaptic neuron active in delay, synapse initially low. In all cases one can permute pre- and postsynaptic neurons, due to the symmetry of the short-term analog learning dynamics. Three situations leading to possible transitions are schematized in Figure 5. To conclude this section, we emphasize that one can imagine different scenarios for the occurrence of LTP when one neuron is active during the delay while the other is active during presentation of the next stimulus. For example, it would also naturally occur if the Hebbian source term of the synaptic dynamics described by equation 3.1 depends not on the instantaneous neural activities but rather on their average over some temporal window. In this section we have argued that in a simple and plausible short-term analog dynamics, this type of transition occurs
Nicolas Brunel
1692
naturally. In the following sections we will no longer consider the shortterm analog synaptic dynamics, but only the resulting stochastic process acting on the two stable synaptic states. 4 Training the Network with a Fixed Set of Stimuli
~
We consider the case of a set E of a finite number of stimuli p. The initial distribution of excitatory-to-excitatory synaptic bonds is assumed uniform.
for all (i.j). During training the stimuli shown to the network are limited to the set E. The learning protocol defines the order in which the stimuli are presented to the network. We will now study the following training protocols: A. Random sequence: At each presentation, the stimulus is chosen ran-
domly out of E. B. Fixed order: The stimuli are presented in a fixed cyclic order: 1.2.. . . . y . 1. and so on. We also study the intermediate situation in which at each time step there is a probability .Y of showing a randomly chosen stimulus in E instead of the predetermined one. For x = 1 one recovers the case of random sequence. C. Random pairs: Stimuli in E are organized in p i 2 pairs. Each stimulus has a paired associate p . The pairs are selected at random. When a pair is chosen, both members are shown successively in a random order. We also study the intermediate situation in which at each time step there is a probability x of showing a randomly chosen stimulus instead of one of the paired associates. Again for s = 1 the random sequence is recovered. Protocol B is similar to the protocol of the experiment of Miyashita (1988). In this experiment the sample stimuli are shown in a fixed order, while the match stimuli are chosen to be the sample with probability 0.5, and a random different stimulus otherwise. Thus it would correspond to protocol B with a probability s = 0.5 o f showing a random stimulus. Protocol C is similar to the protocol of Sakai and Miyashita (1991). In this experiment the sample is a randomly selected stimulus. Then two match stimuli are shown, the paired associate and another randomly chosen stimulus. We consider the case in which the coding level f is very small, so that f p < 1, but ~ C E Lwhere , C i r is the excitatory-to-ixcitatory connectivity, is very large. Consider neurons activated by a specific stimulus 1’. A fraction (exp[-f(p - l ) ! ) 1 - f ( p - 1 I of these neurons is not activated by any other stimulus. Thus whenfp << 1, most selective neurons are activated by only one stimulus. We may therefore consider only these
-
Hebbian Learning of Context in Recurrent Neural Networks
1693
+
neurons, and the network can be functionally divided in p 1 sets of neurons. One set corresponds to neurons that are not activated by any stimulus. This set is denoted by Fo. The other sets of neurons correspond to neurons activated by one of the p stimuli. F, is the population of cells activated when stimulus p is presented, that is, F,, = {i1# = 1). Next we classify the excitatory-to-excitatorysynapses. There are in the network four types of synaptic populations: 1. Synapses that connect two neurons activated by the same stimulus. G,, is the population of all synapses from F, to itself, that is, {(i.j)I# = 1.7): = l}. 2. Synapses connecting two neurons activated by two different stimuli. G,, is the population of synapses from F , to F,, that is, {(i.j)/q"= 1.17; = 1j. 3. Synapses connecting a neuron activated by a stimulus to a neuron not activated by any stimulus. GPOis the population of synapses from Fo to Fir, that is, { ( i . j ) I f = 1.71; = 0 for all v.}, and GO;, is the population of synapses from F,, to Fo, i.e. {(z.j)l$ = O for all 11. :7/ = I}. 4. Synapses connecting two neurons, none of which is activated by any stimulus. Goo is the population of synapses from Fo to Fo, that is, {(i.j)I$ = 0.71; = 0 for all v j .
To calculate the probability distribution of the synaptic efficacies in each of these populations, as a function of the learning protocol and of the duration of training, we define two units of time. The first corresponds to the interval between two presentations. Time in this unit will be referred to as t. The second measure of time, T = pt, corresponds to the interval between two successive presentations of the same stimulus, for a fixed cyclic sequence as in protocol B. At a given time t n,L(t) is the number of times a given stimulus has been presented to the network, while rnp,(t) corresponds to the number of times stimulus v has been presented immediately following the delay activity provoked by stimulus p . The probability distribution of the efficacies in any population GArvis completely characterized by the probability of the synapse's being potentiated, that is, gpv
= P ( l , = llI(i.j) E G,v).
since p(J,, = Jo) = 1 -g,, for (i.]) E G,,. The details of the derivation of these probabilities are given in the Appendix. 1. For a synapse in population G,, whereg(0) is the initial probability of finding a potentiated synapse. Thus, when nu, the number of presentations of stimulus / I , , becomes large, we get g,, -+ 1; all synapses become potentiated.
Nicolas Brunel
1694
2. For synapses in population G,,,, with p # I / , the distribution depends not only on n,,, n”, and n,,,, but also on when the neighbor presentations were done. There are two simple cases in which the distribution can be calculated. The first is when stimuli 11 and n always follow each other. In this case the learning protocol can be divided in two intervals: the first corresponds to the absence of delay activity after presentation of a stimulus. After ( n / , .n,,) presentations we have g,,,, = ( 1 - p
),11’+)1,
gi0).
gradually eliminating the potentiated interstimulus synapses contained in the initial distribution. In the second interval, delay activity has developed. When t ~ , , , ,becomes large we obtain (see Appendix for details)
Another limit case is when 11 and I / are never presented contiguously. In this case the probability of the synapse being potentiated is y
< I“
,
= (1 -
p .)““--“‘g(O)
and therefore \ranishes when the number of presentations becomes very large. In the intermediary situation, when joint presentations occur but not systematically, we define the relative frequency of the contiguous appearance of the two stimuli
,,,(,,=
212 ,‘I/ ~~
11/, f I I ,
The probability of having a potentiated link goes, when the number of presentations becomes very large at fixed pii,,, to
3. For synapses in Gt+, or G,,,, one has 8,lO
= ‘yo/, =
(1 /’- i””,YiOi. ~
and thus the probability of having a potentiated synapse goes to zero in the limit of many presentations of stimulus I / . 4. The last population of synapses is composed of synapses that never see activity in the learning process. These synapses remain unmodified. We will see that these synapses do not play any role in the dynamics o f the network.
Hebbian Learning of Context in Recurrent Neural Networks
1695
We are now able to calculate the parameters gPvfor the learning protocols described at the beginning of the section. For each of these learning protocols, the probability of occurrence of any stimulus is the same. This probability is l / p where p is the number of stimuli. Thus it is convenient to express the parameters gpv as a function of T = pt. For G,,, G,,”, GO-,, and Goo the distribution is independent of the learning protocol
goom = g ( 0 )
By contrast, the synaptic distributions in populations G,, for 11 # n depend rather drastically on the learning protocol. gtL,depends not only on T but also on pp,, the frequency of contiguous presentation of p and 11 connected by a delay activity. The expression for g,, is
g,,(T)
(1 - p-)T(*-p,J(l
=
+
P,VQP+
(
- p - - ap+)PJg(O)
1 - (1- p P,uaP+(l
-
ap+)PJ(l - p - ) P ” ” T
-
P - ) + P-(2 - P - )
1.
Recall that the dependence on the learning protocol arises only when persistent delay activity is present in the network. The next step is to calculate the frequency of contiguous presentation for any pair of stimuli p,,,, starting from the time at which persistent delay activity became stable in the network. Since during training all stimuli are presented the same average number of times, delay activity appears at the same stage of the learning protocol for all stimuli. We also suppose p > 2. Protocol A (random presentation sequence). For all p
# 71 one has
2 p-1’
Ppv =
~
Every pair of stimuli has the same frequency of contiguous occurrence. Protocol B (fixed presentation sequence). One has PppIl
= 1.
since IJ and p,,
11
=0
f 1 always appear contiguously, and for all v
# p . p 31 1
Note that in this case, when the number of presentations becomes very large, the synaptic matrix becomes very similar to the matrix used in Amit et al. (1994) and Brunel (1994). If there is a probability
Nicolas Brunel
1696
s of a randomly chosen stimulus between two successive stimuli, we have
and
Protocol C (paired associates). In this case Pp,,
1.
=
since j i and
p always occur contiguously, and 1
for I / # ,v.p. Again, when a paired associate is replaced by a randomly chosen stimulus with probability s,we have
and 4x11 -s) /),!I,
=
'F
+
2s' ~. /'-1
(1 I-)' ___ ~
p-2
+ 2ps((pl --s) 2)
for all
I/
# I / ./ I
Thus the different synaptic distributions are now completely determined as a function of the learning stage T and of the learning protocol. They are characterized by the matrix p giving the probability of mutual contiguous occurrence of any pair of stimuli in the learning set E. 5 Learned Delay Activity Distributions
To monitor the neural dynamics we define the average activity of neurons in population F,, (neurons driven by stimulus number p ) ,
and the average activity of neurons that are not active in response to any stimulus,
Hebbian Learning of Context in Recurrent Neural Networks
1697
The population-averaged activity in the entire excitatory network is mE(t)
= mO(t)+ f x [ m p ( t )- mO(t)l w
The population-averaged inhibitory activity is
The average recurrent excitatory current impinging on a neuron of a given population F,, (here p denotes either a stimulus or 0) is:
and its variance is
The dynamics of the excitatory network is described by equation 5.1 and 5.2, together with the equations giving the evolution of the means and variances of the depolarizations at the soma of excitatory neurons in populations F,. From equations 2.1 and 2.2, it follows that
The terms appearing on the right-hand side of equations 5.3 and 5.4 are the decay term, the external contribution, the recurrent excitatory contribution, given by equations 5.1 and 5.2, and the inhibitory contribution. The corresponding equations for the inhibitory neurons are given by TlatII =
-IZ
+ IFf + CIEIIEmE - C Z Z ~ I Z ~ I
-.;
and
+
“at
(5.5)
+
(g;) = c I E J : E ~ E - ~z~:zm1. (5.6) 2 In equations 5.5 and 5.6, the terms appearing on the right-hand side are, again, the decay term, the external contribution, the recurrent excitatory contribution, and the inhibitory contribution. The average activity in each population is, in turn, given by
m,
= $E(&,
all)
(5.7)
1698
Nicolas Brunel
and !?I/ =
Cl,(I/.fl/).
(5.8)
where the transduction functions o,, ( ( \ = E. I ) are given by equation 2.3. To obtain the delay activity after presentation of a given stimulus at learning stage T, we proceed as follows: 1. Initially all neurons have their stable spontaneous activity. Only background external currents are present. 2. Stimulus number I / is presented by injecting into neurons of population / / a ”selective” external current above the background one. Neurons in this population are driven by the selective currents well above their spontaneous rates. Presentation lasts 100 ms (= 1071). 3. At the end of the presentation, the ”selective” external currents are removed, and only background external afferents remain. After a short transient, all neurons reach a steady-state delay activity, which persists indefinitely. We choose the following parameters: The synaptic transition probabilities are p + = =: 0.2; the neural parameters are as in Section 2. The background synaptic efficacy is lo= 0.04 mV, while the potentiated synaptic efficacy is ]I == 0.15 mV. The synaptic transition probability in the case of contiguous delay activity and stimulus activation np, is given by the following values of n : n = 0.02 and a = 0.05. We use p = 50 stimuli, each stimulus actilvating a fractionf = 0.01 of the excitatory neurons in the network (Brunel 1994). We have not explored the parameter space. Instead we have chosen a particular set of parameters to exhibit a case of good agreement with the experimentally observed delay activities in IT cortex of performing monkeys. 5.1 Protocol A. Stimuli are shown in a random sequence. The upper part of Figure 6 shows the evolution of delay activities as a function of the learning stage (number of presentations per stimulus) for neurons in the population corresponding to the stimulus presented (diamonds) and neurons in populations corresponding to other stimuli (crosses). It shows that there is a critical learning stage T,, here T,= 11 (minimal number of presentations per stimulus for the creation of an attractor), beyond which selective delay activity appears. This critical learning stage is similar to the critical synaptic parameter of Amit and Brunel (1996). Before T , , neurons that are active during the presentation of any stimulus see their spontaneous activity slightly increase with T. This spontaneous activity is of the order of 3-4 5 - ’ . After T, the neurons representing the shown stimulus have an elevated delay activity of the order of 20-35 sP1. Other excitatory neurons remain at spontaneous activity levels. The critical stage T, depends on the learning speed, which is controlled by the probabilities p + and p - . The lower part of Figure 6 shows the corresponding evolution of the activity of inhibitory neurons (crosses), which
Hebbian Learning of Context in Recurrent Neural Networks
0.35
0.3
-
0.25
-
0.15 0.1
0
-
';* * * * e * * * * * + + + + + + + + +:I
I
I
0.04 -
I
+
I
++++
+
+
+
+ + + + + --
+
-
-
0.03 'Lo
0.02
-
-
0
-- + + + + +
0.025
000e>
-
0
0.045
DA
I
0 0 0
0
0.035
I
-
0.2 DA
I
1699
-
0.015 -
-
-
O 0
-
O O O 0 O O O
0 0 0 0 0 0 07, I
I
I
Figure 6: Upper figure: Delay activity (DA) of neurons coding for the shown stimulus (0)and of neurons coding for other stimuli (+), as a function of the learning stage T . Lower figure: Delay activity of inhibitory neurons (+) and other excitatory neurons (0). Activity is in units of 1/q, that is, 100 s-'. also slightly increases with learning, and of other excitatory neurons not activated in any stimulus (diamonds), which decreases from 3 to 2 s-'. In this case, delay activities are uncorrelated since they simply reflect the structure of uncorrelated stimuli. 5.2 Delay Activities for Protocol B. Stimuli are presented in a fixed order. Before T,, since there is no delay activity in the system, the neural rates are independent of the order of presentation. Immediately after T,,
Nicolas Brunel
1700
DA
DX
15 0.4 I 0.35 0.3 0.25 0.2 0.15 0.1 0.05
20
25
30
i
i
i
20
25
30
35
0
15
35
Serial position
Figure 7: Delay acti1,ity of a cell in population Fzs, as a function of the serial position of the shown stimulus, for 17 = 0.05 and three values of the learning stage T, indicated in the figure. The cell is active in the delay following stimulus 25 but also in the delays following the presentation of its neighbors. These data can be compared with Figure 3a of Miyashita (1988).
1701
Hebbian Learning of Context in Recurrent Neural Networks
0.3 0.25 0.2 DA
0.15 0.1 0.05
DA
0 0.35l 0.3 0.25 0.2 0.15 0.1 0.05 0 15 0.35
20
25
30
35
1 T = 20
DA 0.1 0.05
15
20
25
30
Serial position
Figure 8: Same as Figure 7, but with a = 0.02.
35
1702
Nicolas Brunel
uncorrelated attractors develop as in the case of protocol A. Presentation of a given stimulus I / activatcs neurons of the corresponding population, and this activity is maintained after removal of the stimulus, because synapses connecting these neurons have been sufficiently potentiated. After a while, activity in these neurons also provokes an increase in the actiiyity in neurons in the populations of the neighboring stimuli, that is, 1, -+ 1 and j r - 1, because synapses connecting these populations to F,,, synapses of G,,,,=, , now hare an increased average efficacy. This activity can then propagate to further neighbors, / I i 2 and so on. However, the inhibition controls the overall level of activity in the excitatory network, and therefore the activation spreads to only a few neighbors. This activation is also controlled by the parameter (2, which characterizes the magnitude of the strength of synapses of G,,,,+I relative to those of G,,,,. Depending on this parameter a, there exist two regimes, one of low correlation and the other of high correlation: 1. High correlation (Figure 7, n = 0.05): After T = 15 learning cycles, the activation of a neuron coding for a given stimulus in the delay following the presentation of its neighbors becomes of the order of its activation in the delay following the stimulus itself. When learning proceeds, more neighbors see their neurons increase their delay activity significantly. In this case the correlations between two attractors corresponding to neighbor stimuli are very high.
2. Low correlation (Figure 8, a == 0.02): The activity of neurons in neighboring populations, though increased with respect to the other populations, remains low compared to the activity of neurons that represent the shown stimulus. Correlations between two representations o f neighbor stimuli remain relatively weak. In the absence of stable spontaneous activity (as was the case in Brunel 1994), the structure of the delay activity is always as in Figure 7 (highly correlated delay activities). The presence of a stable spontaneous activity allows for reverberations in which neurons coding for stimuli that are neighbors of the presented stimulus remain at low levels o f activity (compared with the activation of neurons coding for the presented stimulus), though it is significantly higher than their spontaneous activity. Note that in the high-correlation regime, in addition to neurons coding for the presented stimulus, those coding for nearest neighbors will also be significantly active during the delay. This fact implies that from the learning stage in which appears such a high nearest-neighbor delay activity (T= 15 in Fig. 7), learning due to delay activity could occur not only in synapses connecting nearest neighbors, as was assumed in Section 4, but also in synapses connecting next neighbors-that is, synapses from populations G,,,,+,-though quantitatively the potentiation probability will be weaker for these synapses than for nearest-neighbor ones. In turn, at later learning stages, a high next-neighbor delay activity could appear, implying learning in populations of synapses G/,,,ij, and so on.
Hebbian Learning of Context in Recurrent Neural Networks
1703
However, if one allows for learning in synapses connecting more distant neighbors from the learning stage at which appears such significant neighbor delay activity, the picture remains qualitatively very similar. The main difference is that due to the potentiation of these synapses, more distant neighbors will be activated faster during the delay, enabling the network to reach the attractor in a shorter time, and the delay activities of neurons coding for stimuli that are more distant than the nearest neighbor will be slightly higher. In any case, inhibition prevents significant delay activation of a large number of neuronal populations. It is easy to calculate correlations as well as rank correlation coefficients between the delay activities provoked by different stimuli (see Brunel 1994). Qualitatively these correlations are a decreasing function of the distance in the serial position of the stimuli that provoked the delay activities. These correlations decay to zero (or to negative values in the case of rank correlations) at a distance corresponding to the number of populations of cells activated above spontaneous levels in a given attractor. For example, in Figure 7 the correlations would be significant up to a distance of 5 in serial position. 5.3 Protocol C: Paired Associates. In the case of paired associates, the situation is qualitatively similar to protocol 8, except that now only neurons coding for the shown stimulus and its paired associate are activated in the delay period. Also in this case we can identify two regimes, with strong or weak correlation between delay activities corresponding to the pair associates. The main difference is that now, in the strongly correlated regime, the delay activity of paired associate neurons is equal to the delay activity of the neurons coding for the shown stimulus. Therefore the network has formed attractors that do correspond not to the individual pictures but to the pairs of pictures. This can be seen in Figure 9 (a = 0.05) at learning stage T = 15. By contrast, in Figure 10 the representations of paired associates become correlated with learning but remain distinct. Note the similarity of this figure with one of the cells shown in Sakai and Miyashita (1991). However, the comparison is not direct. Sakai and Miyashita (1991) give the activity of cells during presentation of the stimulus. The corresponding delay activity distributions, presented here, are not reported. The analysis predicts that the delay activity provoked by two paired associates should be significantly correlated or even become equal. Note that the formation of similar pair-coding attractors has also been observed in a model with a fixed synaptic matrix (Parga 1994). 6 Discussion
In this paper we have discussed an explicit, plausible learning process in a recurrent neural network, which, in the presence of delay activity, implements the memory of the context of the learned stimuli in the
1704
Nicolas Brunel
0.3
0.25 0.2 Delay
0.15 0.1
0.05 0
-
0.3 0.25 Delay
-
-
T=15
0.2 0.15 0.1 0.05 0 -
0.25 Delay 0.2
-
I
0.4l: 0.4~ 0.35 0.35 0.3 0.3
-
20 I
I
25 I
30 I
35
T = 20
-
0.15 0.1 0.05 0.05 I
I
0
15
20
25
30
35
Serial position
Figure 9: High-delay activity of a cell in population F?j, as a function of the serial position o f the shown stimulus, for n = 0.05. The cell is active in the delay following stimulus 25 but also aftrr its paired associate (stimulus 26) is presented.
Hebbian Learning of Context in Recurrent Neural Networks
1705
Delay
Delay
0.3
-
-
0.25 0.2 Delay 0.15 0.1 0.05 0 15
T=20
-
I
20
I
25
30
Serial position
Figure 10: Same as Figure 9, but for a
= 0.02.
35
1706
Nicolas Brunel
synaptic matrix, In the case of stimuli shown in a fixed sequence during training, this synaptic matrix is found to be qualitatively similar to the matrix that was used in Amit ef n1. (1994) and Brunel (1994). With such a learning process it is possible to determine the statistical properties of the synaptic matrix as a function of the learning stage and the learning protocol. With the network composed of excitatory and inhibitory cells described in Amit and Brunel (1996), whose stable state in the absence of learning is a state in which neurons have a spontaneous activity of the order of 1 spike per second, it is possible to determine the statistical properties of the delay activities, again as a function of the learning stage and the learning protocol. In the only case in which to our knowledge experimental data are available (Miyashita 19SS), we recover the results of Amit eta!. (1994) and Brunel (1994), which are in good agreement with the experiment. Furthermore the analysis allows prediction of either the evolution of the correlations during learning or the dependence of the correlations with the learning protocol. There are a number of tests of the theory presented in this paper that can in principle be done with visual memory experiments: 1. The time of occurrence of selective delay activity should not depend on the learning protocol (i.e., on the way stimuli are presented). 2. Delay activities corresponding to uncorrelated stimuli should initially be uncorrelated. 3. Correlations between delay activities should only depend on the order of presentation nffer the appearance of selective delay activity in the network, and not on the order of presentation prior to delay activity. For example if stimuli are shown in a fixed order before the appearance of selective delay activity but in a random order afterward, the attractors should be uncorrelated.
We turn now to a brief discussion of the elements of the model. Excitatory and inhibitory cells are integrate-and-fire neurons described by the statistics of their input currents and their output firing frequency (Amit and Brunel 1996). This model roughly accounts for the average spontaneous and selective activities observed in the visual memory experiments. Last, though the average delay activities themselves d o depend on the details of the model neuron, the correlations between the attractors of the system seem largely independent of the details of the single neuron. Large-scale simulations of networks of integrate-and-fire neurons are currently under way to confirm that these correlations are preserved if one considers networks of spiking neurons rather than neurons described by firing rates. The implementation of temporal correlations between stimuli in the synaptic matrix depends crucially on a mechanism leading to long-term potentiation when delay activity in one neuron connected by a synapse is immediately followed by stimulus-provoked activity in the other neuron connected by that synapse. This simple mechanism leads to the imple-
Hebbian Learning of Context in Recurrent Neural Networks
1707
mentation of such correlations. In this paper this mechanism-and the whole synaptic process-was supposed to be symmetric in pre- and postsynaptic neurons. This assumption of symmetry was taken for simplicity, but it is not necessary. In fact, experimental data suggest that LTP can be induced when postsynaptic activity follows presynaptic activity by 100 ms (Levy and Steward 1983; Gustafsson et a[. 1987);on the other hand, if postsynaptic activity precedes presynaptic activity, LTP does not occur. The formalism developed in this paper can easily be generalized to such an asymmetric situation. This issue will be considered in a future work.
Appendix: Synaptic Distributions 1. For a synapse in population GILIL: At each presentation of stimulus 11, a synapse that is in its low state has a probability p+ of making a transition to the potentiated state. Thus after n J t ) presentations
where g(0) is the initial probability of finding a potentiated synapse. 2. For synapses in population G,, with p # u, the situation is somewhat more complicated, since the distribution depends not only on n,,, n,, and np, but also on when the neighbor presentations were done. There are two simple cases in which the distribution can be calculated. The first is when stimuli p and u always follow each other. In this case, the learning protocol can be divided into two intervals. The first corresponds to the absence of delay activity after presentation of a stimulus. At each presentation of stimuli p or u, potentiated synapses have a probability p - of making a transition to the low state. Thus after (nw,n,) presentations we have
g,,
=
(1- p-)fl,+'I, d o ) .
In the second interval, delay activity has developed. When a contiguous presentation of p and v occurs, there is a probability ap+ for low synapses of making a transition to the high state. Thus after n,,, occurrences of the contiguous presentation of stimuli p and I / separated by the delay period, we have g,,, = (1- p _ ) " P + " v - " P L y (1- P- - ap+)"uug(o)
When n,, becomes large we have
Another limit case is when p and
Y
are never presented contigu-
Nicolas Brunel
1708
ously. In this case the probability of the synapse's being potentiated is =
( 1- / L , ) 8 r ~ , ~ ' 1 ~ < y ! o )
and therefore vanishes when the number of presentations increases. In the intermediary situation when joint presentations occur but not systematically, we use an interpolation in the relative frequency of the contiguous appearance of the two stimuli /I/*,,
=
211p lzl,
4 !I,.
This expression is
s,,,. =
( 1 - /,
)"'
-11,
-Il,,,
( 1-
i
- f7,LJ.
.-I
)"'"S(o)
1 - (1- / I - 0"- )"I" ( 1 /$(l.qL(l -L I ' 1 + p ( 2 - ,L' 1
tp,,,.f?/?-
~
/JL)"i"
and interpolates between the two preceding limit cases. The probability of having a potentiated link goes, when the number of presentations becomes very large at fixed I)/,,,,to
3. For synapses in Go,,and G,,,,, presentation of stimulus causes presentations one has depression with probability j 7 - , and after
,q,,o
7
,yo/,
=
i 1 - FJ- i""Xi0).
and thus the probability of having a potentiated synapse goes to zero in the limit of many presentations of stimulus I/.
Acknowledgments _ _ - ~ _ _ I am grateful to Daniel Amit and Stefano Fusi for many discussions and to Daniel Amit and Paolo del Giudice for the many detailed comments on a previous version of this article. I also thank the referees for very useful comments. This work was supported by a fellowship of Programme Cognisciences, CNRS, France.
References Amit, D. J. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. BBS 18, 681. Amit, D. J., and Brunel, N. 1995. Learning internal representations in an attractor neural network. Network 6, 359.
Hebbian Learning of Context in Recurrent Neural Networks
1709
Amit, D. J., and Brunel, N. 1996. Global spontaneous activity and learned local delay activity in cortical conditions. Cerebral Cortex, in press. Amit, D. J., and Fusi, S. 1994. Dynamic learning in neural networks with material synapses. Neural Computation 6, 957. Amit, D. J., and Tsodyks, M. V. 1991. Quantitative study of attractor neural network retrieving at low spike rates I: Substrate-spikes, rates and neuronal gain. Netzuork 2, 259. Amit, D. J., Brunel, N., and Tsodyks, M. V. 1994. Correlations of Hebbian cortical reverberations: Experiment us theory. J. Neurosci. 14, 6635. Atick, J. J. 1992. Could information theory provide an ecological theory of sensory processing? Network 3, 213. Badoni, D., Bertazzoni, S., Buglioni, S., Salina, G., Amit, D. J., and Fusi, S. 1995. Electronic implementation of an analog attractor neural network with stochastic learning. Network 6, 125. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. A. Rosenblith, ed., p. 219. MIT Press, Cambridge, MA. Bliss, T. V. P., and Collingridge, G. L. 1993. A synaptic model of memory: Long-term potentiation in the hippocampus. Nature 361, 31. Braitenberg, V., and Schuz, A. 1991. Anatomy of the Cortex. Springer-Verlag, Berlin. Brunel, N. 1994. Dynamics of an attractor neural network converting temporal into spatial correlations. Network 5, 449. Dehaene, S., and Changeux, J. P. 1989. A simple model of prefrontal cortex function in delayed response tasks. J. Cognit. Neurosci. 1, 3. Fuster, J. M. 1973. Behavioural electrophysiology of the prefrontal cortex. J. NeurophysioI. 36, 61. Fuster, J. M. 1995. Memory in the Cerebral Cortex. MIT Press, Cambridge, MA. Fuster, J. M., and Jervey, J. M. 1981. Inferotemporal neurons distinguish and retain behaviorally relevant features of visual stimuli. Science 212, 952 . Goldman-Rakic, I? S. 1987. Circuitry of primate prefrontal cortex and regulation of behaviour by representational knowledge. In Handbook of Physiology, Vol. 5, p. 373. American Physiological Society, Bethesda, MD. Griniasty, M., Tsodyks, M. V., and Amit, D. J. 1993. Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural Computation 5, 1. Gustafsson, B., Wigstrom, H., Abraham, W. C., and Huang, Y. Y. 1987. Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci. 7, 774. Komatsu, Y., Nakajima, S., Toyama, K., and Fetz, E. 1988. Intracortical connectivity revealed by spike-triggered averaging in slice preparations of cat visual cortex. Brain Res. 442, 359. Levy, W. B., and Steward, D. 1983. Temporal contiguity requirements for longterm associative potentiation or depression in the hippocampus. Netrrosci. 8, 791.
1710
Nicolas Brunel
Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Arii)nrlccs iir Neirrnl lrlforirrnfior~Processing Systems 1, D. S. Touretzky, ed., Morgan-Kauffman, San Mateo, CA. Mason, A,, Nicoll, A,, and Stratford, K. 1991. Synaptic transmission between individual pyramidal neurons of the rat visual cortex irr zlitro. 1. Neirrosci. 11, 72. Miyashita, Y. 1988. Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nntirrr 335, 817. Miyashita, Y., and Chang, H. S. 1988. Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nntrire 331, 68. Niki, H. 1974. Prefrontal unit activity during delay alternation in the monkey. Brairz Res. 68, 185. Parga, N. 1994. Private communication. S Relafed Topics on Biology. SpringerRicciardi, L. M. 1977. Diffirsiori P Y @ C ~ ~ ; S E,7rrd Verlag, Berlin. Sakai, K., and Miyashita, Y. 1991. Neural organization for the long-term memory of paired associates. Nntirre 354, 152. Tanaka, K. 1992. Inferotemporal cortex and higher visual function. Ciirreizt Biology 2, 502. Willshaw, D., Buneman, 0. P., and Longuet-Higgins, H. 1969. Non-holographic associative memory. Nnfrirc, 222, 960. Wilson, F. A. W., Scalaidhe, S. P. O., and Goldman-Rakic, P. S. 1993. Dissociation of object and spatial processing domains in primate prefrontal cortex. Science 260, 1955. Zipser, D., Kehoe, B., Littlewort, G., and Fuster, J. 1993. A spiking network model of short-term active memory. J. Ncirrosci. 13, 3406. ~~
Recei\red June 28, 1095, accepted March 7, 1996
This article has been cited by: 2. Nicolas Brunel, Frédéric Lavigne. 2009. Semantic Priming in a Cortical Network ModelSemantic Priming in a Cortical Network Model. Journal of Cognitive Neuroscience 21:12, 2300-2319. [Abstract] [Full Text] [PDF] [PDF Plus] 3. A. Akrami, Y. Liu, A. Treves, B. Jagadeesh. 2008. Converging Neuronal Activity in Inferior Temporal Cortex during the Classification of Morphed Stimuli. Cerebral Cortex 19:4, 760-776. [CrossRef] 4. Jean-Pierre Nadal. 2007. Modelling collective phenomena in neuroscience. Interdisciplinary Science Reviews 32:2, 177-184. [CrossRef] 5. Julian Macoveanu, Torkel Klingberg, Jesper Tegnér. 2007. Neuronal firing rates account for distractor effects on mnemonic accuracy in a visuo-spatial working memory task. Biological Cybernetics 96:4, 407-419. [CrossRef] 6. Emanuele Curti , Gianluigi Mongillo , Giancarlo La Camera , Daniel J. Amit . 2004. Mean Field and Capacity in Realistic Networks of Spiking Neurons Storing Sparsely Coded Random MemoriesMean Field and Capacity in Realistic Networks of Spiking Neurons Storing Sparsely Coded Random Memories. Neural Computation 16:12, 2597-2637. [Abstract] [PDF] [PDF Plus] 7. Carlo R. Laing , André Longtin . 2003. Dynamics of Deterministic and Stochastic Paired Excitatory—Inhibitory Delayed FeedbackDynamics of Deterministic and Stochastic Paired Excitatory—Inhibitory Delayed Feedback. Neural Computation 15:12, 2779-2822. [Abstract] [PDF] [PDF Plus] 8. Gianluigi Mongillo, Daniel J. Amit, Nicolas Brunel. 2003. Retrospective and prospective persistent activity induced by Hebbian learning in a recurrent cortical network. European Journal of Neuroscience 18:7, 2011-2024. [CrossRef] 9. Daniel J. Amit , Gianluigi Mongillo . 2003. Spike-Driven Synaptic Dynamics Generating Working Memory StatesSpike-Driven Synaptic Dynamics Generating Working Memory States. Neural Computation 15:3, 565-596. [Abstract] [PDF] [PDF Plus] 10. Frank van der Velde, Marc de Kamps. 2001. From Knowing What to Knowing Where: Modeling Object-Based Attention with Feedback Disinhibition of ActivationFrom Knowing What to Knowing Where: Modeling Object-Based Attention with Feedback Disinhibition of Activation. Journal of Cognitive Neuroscience 13:4, 479-491. [Abstract] [PDF] [PDF Plus] 11. Néstor Parga , Edmund Rolls . 1998. Transform-Invariant Recognition by Association in a Recurrent NetworkTransform-Invariant Recognition by Association in a Recurrent Network. Neural Computation 10:6, 1507-1525. [Abstract] [PDF] [PDF Plus] 12. Asohan Amarasingham , William B. Levy . 1998. Predicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction ModelPredicting the Distribution of Synaptic Strengths and Cell
Firing Correlations in a Self-Organizing, Sequence Prediction Model. Neural Computation 10:1, 25-57. [Abstract] [PDF] [PDF Plus] 13. Nicolas Brunel, Olivier Trullier. 1998. Plasticity of directional place fields in a model of rodent CA3. Hippocampus 8:6, 651-665. [CrossRef] 14. Daniel J. Amit, Stefano Fusi, Volodya Yakovlev. 1997. Paradigmatic Working Memory (Attractor) Cell in IT CortexParadigmatic Working Memory (Attractor) Cell in IT Cortex. Neural Computation 9:5, 1071-1092. [Abstract] [PDF] [PDF Plus]
Communicated by Shun-ichi Amari
Singular Perturbation Analysis of Competitive Neural Networks with Different Time Scales Anke Meyer-Base lnsfitutefur Flight Mechanics and Control, Petersenstrasse 30, 64287 Darmstadf, Germany
Frank Oh1 Henning Scheich Federal Institute for Neurubiology, Brenneckestrasse 6,3911 8 Magdeburg, Germany
The dynamics of complex neural networks must include the aspects of long- and short-term memory. The behavior of the network is characterized by an equation of neural activity as a fast phenomenon and an equation of synaptic modification as a slow part of the neural system. The main idea of this paper is to apply a stability analysis method of fixed points of the combined activity and weight dynamics for a special class of competitive neural networks. We present a quadratictype Lyapunov function for the flow of a competitive neural system with fast and slow dynamic variables as a global stability method and a modality of detecting the local stability behavior around individual equilibrium points. 1 Introduction: The Class of Neural Networks with Different Time
Scales Competitive neural networks with a combined activity and weight dynamics constitute an important class of neural networks. Their capability of storing desired patterns as stable equilibrium points requires stability criteria that includes the mutual interference between neuron and learning dynamics. This paper investigates the dynamics of competitive neural networks, modeled by a system of competitive differential equations, from a rigorous analytic standpoint. The networks under study model the dynamics of both the neural activity levels, the short-term memory (STM), and the dynamics of synaptic modifications, the long-term memory (LTM). The actual network models under consideration may be considered extensions of Grossberg’s shunting network (Grossberg 1976) or Amari’s model for primitive neuronal competition (Amari 1982). These earlier networks are considered pools of mutually inhibitory neurons with fixed Neural Conzputntion 8, 1731-1742 (1996) @ 1996 Massachusetts Institute of Technology
Anke Meyer-Base, Frank Ohl, and Henning Scheich
1732
synaptic connections. Our results extend the previous studies to systems where the synapses can be modified by external stimuli. The dynamics of Competitive systems may be extremely complex, exhibiting convergence to point attractors and periodic attractors. For networks that model only the dynamic of the neural activity levels, Cohen and Grossberg (1983) found a Lyapunov function as a necessary condition for the convergence behavior to point attractors. Networks where both LTM and STM states are dynamic variables cannot be placed in this form since the Cohen-Grossberg equations d o not model synaptic dynamics. However, a large class of competitive systems has been identified as being "generally" convergent to point attractors even though no Lyapunov functions have been found for their flows. In this paper we apply the results of the theory of Lyapunov functions for singularly perturbed systems to competitive neural networks, which have two types of state variables (LTM and STM), describing the slow and fast dynamics of the system and give a Lyapunov function for the neural multitime scale system. The main finding of this paper is that in the vicinity of a fixed point a weighted sum of individual LyapunoL funtions can serve as a Lyapunov function of the combined system. In this section we introduce the class of laterally inhibited networks, with the lateral inhibition of any form, including the "Mexican hat" type o f interaction, and define the network of differential equations characterizing them. We consider a laterally inhibited network with a deterministic signal Hebbian learning law (Hebb 1949) that is similar to the spatiotemporal system of Amari (1983). The general neural network equations describing the temporal evolution of the STM and LTM states for the jth neuron of an N-neuron network are I
1'
where N, is the current activity level, 0, is the time constant of the neuron, B, is the contribution of the external stimulus term, f i x , ) is the neuron's output, yi is the external stimulus and 171,~ is the synaptic efficiency. The neural network is modeled by a system of deterministic equations with a time-dependent input vector rather than a source emitting input signals with a prescribed probability distribution.' 'Our interest is to store patterns as equilibrium points in the N-dimensional space. In fact, Amari (1982) demonstrates the formation of stable one-dimensional cortical maps under the aspect of topological correspondence and under the restriction o f a constant probability of the input signal.
Perturbation Analysis of Neural Networks
1733
By introducing the dynamic variable S, = y'm,, we get a state space representation of the LTM and STM equations of the system:
The input stimuli are assumed to be normalized vectors of unit magnitude IyI2 = 1. This system is subject to our analysis considerations regarding the stability of its equilibrium points:
2 Asymptotic Stability of Neural Networks with Different Time Scales It is shown in this section how the global asymptotic and local stability of this class of neural networks can be determined by interpreting them as nonlinear singularly perturbed systems. Singular perturbation methods represent in engineering simplifications of dynamic models, and are used as an approximation method of analyzing nonlinear systems. They reveal multiple-time scale structures inherent in many practical problems. The solution of the state equation quite often exhibits the phenomenon that some variables move in time faster than other variables, leading to the classification of variables as "slow" and "fast." In practical examples we deal with dynamic systems, so we have to perform a two-stage model by separating the time scales. The reduced model represents the slowest phenomena, while the boundary-layer models evolve in faster time scales and represent deviations for the predicted slow behavior.2 Singular perturbation theory embraces a wide variety of dynamic phenomena possesing slow and fast modes, as they are present in many neurodynamic problems. Competitive neural networks with two time scale dynamics can be formulated in a more generally form and interpreted as singularly perturbed systems. We will adopt the notations of Vidyasagar (1993) and Saberi and Khalil(l984) in introducing the general neural singularly perturbed system. Consider a competitive neural system that is described by the following system of nonlinear differential equations: €X = g(x,
s,€)
(2.1)
2A similar approximation method is given in Amari (1982) and is generally known in the literature as the "adiabatic" hypothesis method.
Anke Meyer-Base, Frank Ohl, and Henning Scheich
1734
s = fix.Sj
(2.2)
-
where f: R' x R" R', g: R' x R,' -- R' are continously differentiable and satisfy f(O .0)=. 0 and g(O.0)= 0. Equation 2.1 models the fast system and equation 2.2 the slow system. Both equations are a generalized representation of equations 1.3 and 1.4. This time scale approach is asymptotic, that is exact, in the limit as the ratio f of the speeds of the slow versus the fast dynamics tends to zero. When F is small, approximations are obtained from reduced-order models in separate time scales.3 A reduced system is defined by setting F = 0 in (2.1) and (2.2) to obtain 0-g(x.S.0)
(2.3)
s = fix. S )
(2.4)
Assuming that 2.3 has a unique root x rewritten as S
=
=
h i s ) , the reduced system is
f(h(S).S ) = f , ( S )
(2.5)
A boundary-layer system is defined as OX
g[x S ( T j . 0 or where T = t / f is a stretching time scale and the vector S E RN is treated as a fixed unknown parameter. In broad terms, the basic objective of singular perturbation theory is to draw conclusions about the behavior of the original system (2.1 and 2.2) based upon a study of a simplified system S = f,(S) obtained from 2.2 setting 0 = g(x. S. 0 ) . Via the method of singular perturbations we are able to give local and global stability analysis methods for analyzing the behavior of STM- and LTM-state neural network. ~
2.1 The Local Stability Analysis Method. The local stability around an equilibrium point is analyzed according to the following theorem (Vidyasagar 1993):
Theorem 1. Consider the system gizw by eqiratiorzs 2.2 and 2.2. Define All A12 A21
= = =
I~f/hSl/,o.,, ~hf/hxlI,,.o, [~g/~SI/,o,o,~ '
'
'The scalar f represents all the small parameters to be neglected. In many applications, including neural networks, having a single parameter is not a restriction. For example, if R , and nz are small neural time constants o f the same order of magnitude, Oin, I = O(n2),then one of them can be taken as F and the other expressed as its multiple.
Perturbation Analysis of Neural Networks
1735
and
and supposeAZ2is nonsingular. Under these conditions, ifboth AZ2and A. = All -AI2A;:A21 are Hurwitz, then there is an Fo such that (0,O) is an asymptotically stable equilibrium of the system (2.1 and 2.2) whenever 0 < 6 < €0. If the equilibrium under study is not the origin, one can always redefine the coordinates on R" in such a way that the equilibrium of interest becomes the new origin. The assumption that f(0,O) = g(0,O) = 0 leads to the well-known fact that the function that models the neuron's output passes through the zero point. The Hurwitz criterion and the derivation of the matrices All, A12, A21, and AZ2are given in Appendices A and B (see equations A.l, B.l, B.2, B.3, and 8.4). If (0,O)is an asymtotically stable equilibrium of the system 1.3 and 1.4, we have to prove that the matrices AZ2and A0 = All - AIzA;'AzI are H ~ r w i t z . ~ AZ2is Hurwitz if the following holds: Dllk, I a1
(2.7)
and
Let us assume that 6 ~ ( x l ) / ~ x l=~ k,. ~ o Setting , o ~ A. get
= All
-
Al2A;.AZ1, we
(2.9)
where &, is the algebraic complement of the element AZ2{i.j} of the matrix AZ2.If A. is Hurwitz, then the following must hold: (2.10)
and (2.11) 'In parlance of singular perturbation theory the matrix A22 is said to represent the fast dynamics, while A. is said to represent the slow dynamics. Therefore x is often referred to as the fast state variable and S as the slow state variable, and its time evolution is governed by the matrix Ao.
Anke Meyer-Base, Frank Ohl, and Henning Scheich
1736
Splitting the model into fast and slow dynamics reveals the useful and interesting fact, that the Hurwitz condition of the fast matrix A22 represents the stability conditions of a neural network with only activity dynamics, like the Hopfield network. The local stability method is a valuable tool for designing stable neural networks for a given equilibrium point (Meyer-Base 1995; Meyer-Base et 171. 1994). The following example of a two neuron network illustrates how our proposed choice of initial conditions can be used to achieve local stability for an equilibrium point. Example 1 : Let N = 2, a, = A, B, = B, D,, = ( t > 0, D,, = - . j < 0 and the nonlinearity be a sigmoid function. We construct a stable equilibrium point in the neural network at (XI = - ~ / A . X ~ = ~ B + ( I ) / A =O . S. S, z = l ) . If the slow and fast matrices for the chosen equilibrium point are stable (Hurwitz matrices), then the inequalities A - ( I > 0 and B > 0 must hold. A simulation example in Figure 1 shows such a system. Choosing A = 1, .1 = B = 2, and ( I = 1, we see that the trajectories of the LTM and STM states converge to the required equilibrium point. 2.2 The Global Stability Analysis Method. Having a modality of detecting the local stability around an equilibrium point, we return to our main problem of finding a Lyapunov function for the multitime scale neural network. In Saberi and Khalil (1984), it is shown that a quadratic-type Lyapunov function establishing asymptotic stability for a singularly perturbed system can be obtained as a weighted sum of the lower-ordered reduced and boundary-layer systems, assuming that the perturbation factor is sufficiently small. Theorem 2 (Saberi and Khalil 1984) states this formally:
Theorem 2. Siippose that there rsist LyapirnozJ fiiiictioizs for the reduced and the boundary layer systeni and that f an> g satisfy certniii groillth reqiiirernents as ShOilJn in Snberi arid Khalil (19841. Theii the ori'yiii ( x = S = 0) zs m i asymptotically stable eqiiilibriuni point qf f l i p sing-itlarly putiirbed system 2.1 and 2.2for all c iF * ( d ) . Moreozltr,~forezwy d E (0. 1i ZJ(X. S ) =
(1 - d ) V i S ) + dWix. S )
(2.12)
i s a L?~apunozJfiriictionfor2.1 and 2.2 for all F < c ' ( d ) , zihere V is t h Lynpunov function for the reduced order system and W oftlie boiindnry layer system.
The success of recalling a desired pattern vector from partial information is directly related to the stability boundaries of the corresponding equilibrium point. A common approach of estimating the domain of attraction is through the use of critical level values of the associated Lya-
Perturbation Analysis of Neural Networks
1737
3
2.5 2
1.5 10
1
u
u
m
0.5
B
0
-0.5 -1 -1.5
-2 0
1
2
3
5
6
7
8
9
10
3 4 5 t i m e in m s e c
6
7
8
9
10
4
time in msec 1 0.9
0.8 0.7
rn
u
0.6
u
m
0.5
B
0.4 0.3 0.2 0.1
0 0
1
2
Figure 1: Time histories of a two-neuron neural network with (XI = -/j,’A. x2 = ( B + ( ? ) / A S1 . = 0. S2 = 1) as an equilibrium point: (a) STM states; (b) LTM states.
punov functions. Saberi and Khalil (1984) mention an estimation of the domain of attra~tion:~ 5The symbol B, indicates a closed sphere in RN centered at x =O; Bs is in RN defined in the same way.
Anke Meyer-Base, Frank Ohl, and Henning Scheich
1738
L = {S E €35.
Y
c n B, 1 ( 1-d ) V (S ) +d W ( .Y.S ) 5 min[( 1- d ) z~,). dzo,~]}
(2.13)
with LR = {SjV(S) 5 510) c Bq the domain of attraction of the reduced system and LB = { S . x / W ( s . S )5 x i ( ] }(1Ek x B, the domain of attraction of the boundary-layer system. The size of the domain of validity is the main advantage of this method compared to a simple linear stability analysis? To apply Theorem 2 we must determine for our multitime scale neural network 1.3and 1.4 two Lyapunov functions: one for the boundary-layer system and one for the reduced-order system (Meyer-Base rt al. 1995). Cohen and Grossberg (1983) mention a global Lyapunov function for d competitive neural network 1.1 with only an activation dynamics: (2.14) under the constraints: nzIi = m,,, a,(x,)2 0 a n d J ( s , ) 2 0. This Lyapunov function can be adapted to the boundary-layer system, if the LTM contribution S,is treated as a fixed unknown parameter, yielding the Lyapunov function:
c
1 ’ -
2
1-
D,f,iX,)fk(Xk)
(2.15)
1
For the reduced-order system we can take the Lyapunov function: (2.16)
As stated in Theorem 2, the Lyapunov function for the STM and LTM dynamics is the superposition of the two previous Lyapunov functions: i’(X.S)=
(1 - d ) V ( S )+ d W ( x . S )
(2.17)
The following example of a two neuron network illustrates how the proposed Lyapunov function can be used to design global stable neural networks. Esatnple 2: Let N = 2, a, = A, B, = B, D,, = (1 > 0, D,, = -‘j < 0, and letfix,) = x,. For the boundary-layer system, we get i., = -AX,
+
i
D , J ( s , ) + BS,
(2.18)
,-=I
’For a linear singularly perturbed system we have, as shown in Theorem 1, to prove that the slow matrix A. as well as the fast matrix Az2 are Hurwitz.
Perturbation Analysis of Neural Networks
1739
and for the reduced-order system: (2.19) Then for the Lyapunov functions we get 1 V ( S )= -(s: 1- s:) 2
(2.20)
and
1 (2.21) --[(Lx; + ox; - 2/jx,x2] 2 Finally, for the Lyapunov function for the complete system we get
v(x,S )
= =
+ + +
(1 - d ) V ( S ) dW(x.S ) 1 A A (1 - d ) p : s;) d -x: -x: - BSlXl - B S 2 X 2 2 2 1 - - [Ux: nx; - 2pxIxZl (2.22) 2
+
+
1
The above results imply A - (Y > 0 > B. These equations can be interpreted as follows: To achieve a stable equilibrium point (0.0) we must have negative contribution by the external stimulus term, and the sum of the excitatory and inhibitory contribution of the neurons must be less than the time constant. An evolution of the trajectories of the STM and LTM states for a two neuron system is shown in Figure 2. Choosing B = -5, A = 1, and N = 0.5 we get (Saberi and Khalil 1984) for f * ( d ) :
c * ( d )=
11 x 0.52
55+
&
(2.23)
c * ( d ) has a maximum at d = d* = 0.5. The basin of attraction is given by equation 2.13: L = {S E Bs,x E BJ(1 - d ) V ( S )+ dW(x, S ) 5 0.5}, which defines the neighborhood of the equilibrium point for which Theorem 2 is valid. 3 Conclusions
Competitive neural networks with a combined activity and weight dynamics can be interpreted as nonlinear singularly perturbed systems. We have presented a global stability analysis method of an equilibrium point representing the stored pattern. The proposed quadratictype Lyapunov function supposes a monotonically increasing nonlinearity. This method provides an upper bound on the perturbation parameter,
Anke Meyer-Base, Frank Ohl, and Henning Scheich
17-10
1.2 0
1
2
3 4 5 time in m e 3
6
7
8
9
10
(a)
0 4 -
ro u m
2 E
0 3 -
0 2 -
Figure 2: Time histories of a two-neuron neural network with the origin as an equilibrium point: (a) STM states; (b) LTM states. and thus an estimation of a maximal positive neural time constant, and an estimation of the domain of attraction of the equilibrium point. We have also given a local stability analysis method for determing the stability around individual equilibrium points, and for analyzing the stability of the fast and slow parts of the neural system.
Perturbation Analysis of Neural Networks
1741
Appendix A: Hurwitz Matrix A square matrix is said to be Hurwitz if all of its eigenvalues have negative real parts. To show if a matrix B is Hurwitz, we use a famous eigenvalues localization theorem, the so called Gersgorin’s Theorem (Parks and Hahn 1993). This theorem states that the eigenvalues of a real N x N matrix B are contained in the union of the N disks of the complex A-plane
i.e., B,, < 0 and lBll1 >
z,+ lBlll to guarantee stability.
Appendix B: Derivation of Matrices Allr A1zr A21, and A22 Determination of the matrices AP1and AZ2from Theorem 1:
AZ1= diag[Bl/al.. . B N / ~ N ]
03.1)
Assume that b ~ ( ~ ~ ) / b x = , ~k,.~ ,Then , ~ ~ we get
and for All and A12:
.
{ o.
-1.
All{i3j} =
This means that All
1 = ]
.
otherwise
=
-I.
Acknowledgments We thank the reviewers for their comments and suggestions, which have helped us to improve the contents and presentation of the work.
Anke Meyer-Base, Frank Ohl, and Henning Scheich
1732
References Amari, S. 1982. Competitive and cooperative aspects in dynamics of neural excitation and self-organization. In Coiirprtitiori nird Cooperntioii iiz Nenrnl N e f rcwks, s. Amari and M. A. Arbib, eds., Vol. 45, pp. 1-28. Springer Lecture Notes in Biomathematics. Amari, S. 1983. Field theory of self-organizing neural nets. ZEEE Trnnsnct. Syst. Mail C!/bt’riZ. SMC-13, 815-826. Cohen, A. M., and Grossberg, S. 1983. Absolute stability of global pattern formation and parallel memory stornge by competitive neural networks. I € € € Trmisnct. S p t . hlnu Cyberu. SMC-13, 815-826. Grossberg, S. 1976. Adaptive pattern classification and universal recording. B i d . Cybern. 23, 121-134. Hebb, D. 0. 1949. T/ic Orgniiiantiori c!f Bdinzjior. John Wiley, New York. Meyer-Base, Anke. 1995. Neural networks with dynamic properties and radial basis networks with an application in speech recognition. Dissertationsschrift, Institut fur Datentechnik. Meyer-Base, A,, Ohl, F., and Scheich, H. 1991. Matrix perturbation theory as an analysis method of the stability of laterally inhibited neural networks. I E E E Ilzt. Synip. Artificinl N m r d Nt,ti(lorkj TRI’;LWII 1, 356-364. Meyer-Base, A., Ohl, F., and Scheich, H. 1995. Stability analysis techniques for competitive neural networks with different time-scales. I E E E joint Int. Cot$ N1~11rnlNetworks i i i Pertli, 12. Parks, P. C., and Hahn, V. 1993. Stnbility Tlreor!y. Prentice Hall, Englewood Cliffs, NJ. Saberi, A., and Khalil, H. 1984. Quadratic-type Lyapunov functions for singularly perturbed systems. ZEEE f i n i i s n c t . Aiitoriintic Coiztrctl, Seiten, 542-550. Vidyasagar, M. 1993. Norilirzenr S!/stmzs Aiznlysis. Prentice Hall, Englewood Cliffs, NJ. .-
~
Received Apnl 18, 1995, accepted February 29, 1996
This article has been cited by: 2. Vidya Bhushan Maji, T. G. Sitharam. 2008. Prediction of Elastic Modulus of Jointed Rock Mass Using Artificial Neural Networks. Geotechnical and Geological Engineering 26:4, 443-452. [CrossRef] 3. A. Meyer-Base, V. Thummler. 2008. Local and Global Stability Analysis of an Unsupervised Competitive Neural Network. IEEE Transactions on Neural Networks 19:2, 346-351. [CrossRef] 4. Wen Yu, Xiaoou Li. 2007. Passivity Analysis of Dynamic Neural Networks with Different Time-scales. Neural Processing Letters 25:2, 143-155. [CrossRef] 5. H. Lu, S.-I. Amari. 2006. Global Exponential Stability of Multitime Scale Competitive Neural Networks With Nonsmooth Functions. IEEE Transactions on Neural Networks 17:5, 1152-1164. [CrossRef] 6. A. Meyer-Baese, S.S. Pilyugin, Y. Chen. 2003. Global exponential stability of competitive neural networks with different time scales. IEEE Transactions on Neural Networks 14:3, 716-719. [CrossRef]
Communicated by Todd Leen
How Dependencies between Successive Examples Affect On-Line Learning Wim Wiegerinck Tom Heskes RWCP*Novel Functions SNNt Laboratory, Department of Medical Physics and Biophysics, University of Nijmegen, Geert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands
We study the dynamics of on-line learning for a large class of neural networks and learning rules, including backpropagation for multilayer perceptrons. In this paper, we focus on the case where successive examples are dependent, and we analyze how these dependencies affect the learning process. We define the representation error and the prediction error. The representation error measures how well the environment is represented by the network after learning. The prediction error is the average error that a continually learning network makes on the next example. In the neighborhood of a local minimum of the error surface, we calculate these errors. We find that the more predictable the example presentation, the higher the representation error, i.e, the less accurate the asymptotic representation of the whole environment. Furthermore we study the learning process in the presence of a plateau. Plateaus are flat spots on the error surface, which can severely slow down the learning process. In particular, they are notorious in applications with multilayer perceptrons. Our results, which are confirmed by simulations of a multilayer perceptron learning a chaotic time series using backpropagation, explain how dependencies between examples can help the learning process to escape from a plateau. 1 Introduction The ability to learn from examples is an essential feature in many neural network applications (Hertz et al. 1991; Haykin 1994). Learning from examples enables the network to adapt its parameters or weights to its environment without the need for explicit knowledge of that environment. This paper focuses on a popular learning procedure called on-line learning. In this learning procedure examples from the environment are continually presented to the network at distinct time steps. At each time 'RWCP: Real World Computing Partnership. IS": Dutch Foundation for Neural Networks.
Neuval Cornputdon 8, 1743-1765 (1996) @ 1996 Massachusetts Institute of Technology
1744
Wim Wiegerinck and Tom Heskes
step a small adjustment of the network’s weights is made on the basis of the currently presented example. This procedure is iterated as long as the network learns. The idea is that on a larger time scale the small adjustments sum up to a continuous adaptation of the network to the whole environment. In many applications the network has to be trained with a training set consisting of a finite number of examples. In these applications a strategy is often used where at each step a randomly selected example from the training set is presented. In particular, with large training sets and complex environments successful results have been obtained with this strategy (Brunak rt 171. 1990; Barnard 1992). Characteristic of this kind o f learning is that successive examples are independent, i.e., that the probability to select an example at a certain time step is independent o f its predecessors. Of course, successive examples in on-line learning do not need to be independent. For example, one can think of an application where the examples are obtained by on-line measurements of an environment. If these examples are directly fed into the neural network, i t is likely that successive examples are correlated with each other. A related example is the use of neural networks for time-series prediction (Lapedes and Farber 1988; Weigend et al. 1990; Wong 1991; Weigend and Gershenfeld 1993; Hoptroff 1993). Essentially, the task of these networks is, given the last k data points of the time series, to predict the next data point of the time series. Each example consists of a data point and its k predecessors. There are two obvious ways to train a network ”on-line” with these examples. In what we call ”randomized learning,” successively presented examples are drawn from the time series on arbitrary, randomly chosen times. This makes successively presented examples independent. In the other type of learning, which we call ”natural learning,” the examples are presented in their natural order, keeping their natural dependencies. In Mpitsos and Burton (1992) and Hondou and Sawada (1994),both types o f example presentation are compared for the learning of a one-dimensional chaotic mapping. In their simulations natural learning performs significantly better than randomized learning. This phenomenon, and, more generally, how the presentation order of examples affects the process of on-line learning are the subject of this paper. Understanding these issues is not only interesting from a theoretical point of view, but it may also help to devise better learning strategies. In this paper we study the dynamics of on-line learning with dependent examples from a general point of view. In Section 2, we define the class of learning rules and the types of stochastic, yet dependent, example presentation which are analyzed in the rest of the paper. Because of the stochasticity in the presentation of examples, on-line learning is a stochastic process. However, since the weight changes at each time step are assumed to be small-in this paper the weight changes scale linearly with a small constant 11, the so-called learning parameter-it is possible to give approximate deterministic descriptions of the learning process on
On-Line Learning
1745
a larger time scale. In lowest order, the learning process can be described by an ordinary differential equation (ODE). The fluctuations, i.e., the differences between the stochastic trajectory of the weights and the ODE, are of order Jii. These fluctuations are described by a covariance matrix. Besides an heuristic (re)derivation of the ODE and the equation for the fluctuations [a more rigorous derivation can be found in Benviste et al. (1987) and Kuan and White (1994)], Section 3 also derives in the same heuristic framework an equation for a systematic bias. This bias, which is of order 7,describes the lowest order difference between the mean value of the weights and their ODE trajectory. One could interpret the bias as a first order correction to the ODE. With these equations, we will study the effect of dependencies in the examples on the learning process. In Section 4 we use these results to calculate how the presentation of examples affects asymptotic performances like the representation error and the prediction error. The representation error measures how well the environment is represented by the network after learning. The prediction error is the average error that a continually learning network makes on the next example. In Section 5 we use the results of Section 3 to study the effect of dependencies when the learning process is stuck on a so-called plateau in the error surface. Plateaus are frequently present in the error surface of multilayer perceptrons (Hush et al. 1992). Using the results in this section, the remarkable difference between randomized learning and natural learning, which has been mentioned in the previous paragraph, is explained. The last section gives a brief summary and discussion. 2 The Framework
In many on-line learning processes the weight change at learning step n can be written in the general form
Aw(n) = w(n + 1)- w ( n ) = a f [ w ( n ) . x ( n ) ]
(2.1)
with w ( n )the network weights and x ( n ) the presented example at iteration step n. 77 is the learning parameter, which is assumed to be constant in this paper, and f(.. .) the learning rule. Examples satisfying equation 2.1 can be found in supervised learning such as backpropagation for multilayer perceptrons (Werbos 1974; Rumelhart et aI. 1986),where the examples x ( n ) are combinations of input vectors [ x l ( n ) ., . . xk(n)] and desired output vectors [yl ( n ) >. . .y,(n)],as well as in unsupervised learning such as Kohonen’s self-organizing rule for topological feature maps (Kohonen 1982), where x ( n ) stands for the input vector [ x l ( n ) .. . . xk(n)]. On-line learning in the general form of equation 2.1 has been studied extensively (Amari 1967; Ritter and Schulten 1988; White 1989; Heskes and Kappen 1991; Leen and Moody 1992; Orr and Leen 1992; Hansen et af. 1993; Radons 1993; Finnoff 1994). Many papers on this subject have been restricted to independent presentation of examples; i.e., the probability >
Wim Wiegerinck and Tom Heskes
1746
pix. 11) to present an example x at iteration step n is given by a probability distribution / I ( x ) , independent of its predecessor. Dependencies between
successive examples have been studied in Benviste et a / . (1987, and references herein) and recently in Kuan m d White (1994) and Wiegerinck and Heskes (1994). In this paper dependencies between examples are incorporated by assuming that the probability to present an example x depends on its , that p ( x . n ) predecessor x‘ through a transition probability ~ ( x l x ’ ) i.e., follow a first-order stationary Markov process p ( x .n
+ 1) =
I
LfX’T(X/X’)P(X/. ?I)
(2.2)
Learning with independent examples is a special case with ~ ( x j x ’= ) p(x). The limitation to first-order Markov processes is not as severe as it might seem at first sight, since stationary Markov processes of any finite order k can be incorporated in the formalism by redefining the vectors x to include the last k examples (Wiegerinck and Heskes 1994). The Markov process is assumed to have a unique asymptotic or stationary distribution p ( x ) , i.e., we assume that we can take limits like
in which o j x ) is some function of the patterns. So p ( x ) describes the (asymptotic) relative frequency of patterns. A randomized learning strategy therefore will select its independent examples from this stationary distribution. In this paper we will denote these long time averages with brackets (.), . ( ( ; A x ) ) ,E
/ d.Up(x)o(x)
and sometimes we use capitals, i.e., we define quantities like e ( ~ ( x ) ) , . Many neural network algorithms, including backpropagation, perform gradient descent on a ”local” cost or error function e( zu. x), f ( w ( n ) . x ( 1 7 ) )3
-
V z , , t ( ~ ( ~ . ~ ) . ~ ( ~ ~ : )
(2.3)
The idea of this learning rule is that with a small learning parameter, the stochastic gradient descent (equations 2.1 and 2.3) approximates deterministic gradient descent on the ”global” error potential
We restrict ourselves to learning with a cost function in order to compare performances between several types of pattern presentation (with equal stationary distributions), in particular in Sections 4 and 5. However, most derivations and results in this paper can be easily generalized to the general rule in equation 2.1.
On-Line Learning
1747
3 ODE Approximation and Beyond
The update rule for the weights in equation 2.1 and the Markov process governing the presentation of examples in equation 2.2 can be combined into one evolution equation for the joint probability f‘(w,x , n ) that at step n example x is presented to the network with weight vector w.This probability obeys the Markov process
P(w.x , n + l)=/dw’dx’r(xlx’)S(w- w’- ~rf(w’. x ’ ) ) P(w’.x’. n ) . (3.1) We are interested in the learning process, i.e., in the evolution of the probability distribution of weights
P(w.n)
=
/dxPjw.x.n).
With dependent examples, it is not possible to derive a self-supporting equation for the evolution of P(w,n ) by direct integration over x in equation 3.1. However, in Weigerinck and Heskes (1994) it is shown that the evolution equation of P(w,n ) can be expanded systematically in the small learning parameter 7 . The basic assumption for this expansion is that the dynamics of the weights, with typical time scale 1/11, is much slower than the typical time scale of the examples. In the following, we present a slightly different approach to approximate the evolution of the probability distribution of weights. This approach, based on van Kampen (1992), assumes that the distribution of weights, with initial form P(w,0) = S [w- w(O)],remains sharply peaked as n increases. We follow the heuristic treatment in Benviste et al. (1987) and average the learning rule over a ”mesoscopic” time scale (Hansen et al. 1993),which is much larger than the typical time scale of the example dynamics yet much smaller than the time scale on which the weights can change significantly. With the averaged learning rule we can directly calculate approximate equations for the mean w(n)and the covariance matrix C2(n),which describe the position and the width of the peak P(w,n ) , respectively. We iterate the learning step from equation 2.1 M times, where M is a mesoscopic time scale, i.e., 1 << M << 1/17, and obtain M-l
w(n+M)-w(n)
=
qxf[w(n+rn),x(n+m)].
(3.2)
wr=O
For the average 5 ( n ) G ( w ( n ) )(brackets (. . .) stand for averaging over the combined process in equation 3.1), we have the exact identity
w(n+ M ) - w(n) =
M-l
77
C (f[
m=O
~ (+nm ) ,x(.
+ m ) ] ).
(3.3)
Wim Wiegerinck and Tom Heskes
1748
On the one hand, the mesoscopic time scale is much smaller than the time scale on which the probability distribution P(zu. J Z ) can change appreciably. Therefore, if the probability distribution P( w.1 1 ) is very sharply peaked, we can expand equation 3.3 around the mean W ( J I )
ItiL-il
On the other hand, the mesoscopic time scale is much larger than the typical time scale of the Markov process governing the presentation of examples. Therefore we can approximate the sum
3
(3.4)
F [ ~ ( J z ) ],
Thus, in lowest order, the stochastic equation 3.3 can be approximated by the deterministic difference equation ~-
w(JI
-tMj - zO(!I)
=
i/MF [E(Jl)] .
For small r/M, the difference equation for the position of the peak turns into an ordinary differential equation (ODE). In terms of the rescaled continuous time t, with t,, E ~ / J Z[we will use both notations Z U ( Y I ) and zui t i ] , we obtain that the learning process is approximated by the ODE (3.5) In this equation € ( z o ) IS the global error potential defined in equation 2.4. In lowest order the weights do indeed follow the gradient of the global error potential. Dependencies in successively presented examples have no influence on the ODE (equation 3.5): this equation depends only on the stationary distribution p ( x ) of the examples. Corrections to the ODE arise when we expand (equation 3.2)
-
w(J? M) - ~
M- I
( t l !=
f
i/
vro
[ Z U ( J I+ I 7 I ) . . U ( l l
+ ttf)]
On-Line Learning
1749
+
M-1
+
h [ w ( n ) , x ( n m)]
772
m=O IN - 1
x
C f [ w ( n ) , x ( n + 1 )+]' . '
(3.6)
i=n
with the "local Hessian" h ( w , x ) = V,,,V&w.x) . Using the separation between time scales, we approximate this expansion by
+ M)
W(M
=
-
{
w(n)
1 TIM F [w(n)] $3 [w (n )] - p M H [w(n)] F(w(n)i- .
+
with the "Hessian"
and
N-1
=
lim N-n,
C [I - -
r1=l
Note that B(w) is zero with independent examples. Later on we will see that the term containing H [ w ( MF) ][ w ( Mwill ) ] vanish by the transformation to continuous time. Averaging equation 3.7 yields E ( n + M ) - w(n)
and by expansion of the right-hand side around the mean Z ( n )we obtain Z ( n + M) - w(n)
i
1 2 1 + 7jB [w(n)]- -7lMH [w(n)] F [Z(n)] . ' . 2 1 = r/M F [ S ( n ) ] -Q [w(n)] : C 2 ( n )+ r/B [ W ( n ) ] 2 1 - -7jMH [Z(n)]F[w(n)]+. . . 2 = TIM
{
F
[w(n)]- -Q [ ~ ( Y z ): ]( [ ~ ( n )w ( n ) ] [ w ( n-)w ( ~ ) ] ' )
+
1
Wim Wiegerinck and Tom Heskes
1750
in which
(3.11) ( Q : E 2 ) c p= ~ Q r i j . Z 2 + , .
(3.12)
j-
Transformation to continuous time finally yields a first approximation beyond the ODE in equation 3.5 (3.13) Unlike equation 3.5 this is no longer a self-supporting equation for W alone; higher moments enter as well. The evolution of the mean 5 in the course of time is therefore not determined by w itself, but is influenced b!. the fluctuations around this average through their covariance 2'. It is clear that for the existence of the ODE approximation and of its higher order approximations, it is neccessary that the fluctuations are small. In derivation of equation 3.13 we have used in foresight that these fluctuations are of order fi and therefore their covariance 1 ' is of order ri. In fact, similarly to the derivations of equations 3.5 and 3.13, a lowest order approximation for the fluctuations can be derived, d F(t) ..
nt
=
-H [.Z.(t)]Y'(f)
-
Y ? ( t ) H[;iZ(f)] t r/D [.Z.(t)]
(3.14)
with the "diffusion" matrix lim :z~- x
1
\-lY-l
{ ~ ~ ~ . - ~ ( ~ I ) ] - F ( Z U ) } C ~ [ W . . Y ( I I I. )(3.15) ]--F(ZU)} 11-0
r),
lri=O
From equation 3.14, we can see that T'(t) remains bounded if H is positive definite. In this case Y 2 ( t )= O ( / / ) ,which makes equation 3.14 with equation 3.13 a valid approximation (van Kampen 1992). In other words, since 11 is small, this justifies a posteriori the assumption that P(zu.n ) is sharply peaked. In other cases where the fluctuations do not remain bounded, the approximation is applicable only during a short period. The diffusion D ( w j can be expressed as the sum of an independent and a dependent part:
where we have defined the auto-correlation matrices
C , , ( w ) = ({f [ w . . u f ~ l- )Fj ( z u ) } ( f [ z u . x ( O ) ]- F ( z u ) } ' )
.
(3.17)
&-Line Learning
1751
For on-line learning with random sampling, there are no dependencies between subsequent weight changes, so C+(w)= 0 and consequently the diffusion D(w) reduces to C,(w) (see, e.g., Heskes 1994). The set of equations 3.13 and 3.14 for Wand C2 forms a self-supporting first approximation beyond the ODE approximation in equation 3.5. It is not necessary to solve equations 3.13 and 3.14 simultaneously. Since the covariance C2 appears in equation 3.13 as a correction it suffices to compute C2 from equation 3.14 using the ODE approximation for m. Following van Kampen (1992) we set W = wODE u, and solve
+
(3.18)
-r/B [wODt(t)]’
(3.20)
Equations 3.18 and 3.19 are equivalent to results that one can find in the literature (Benevisteet al. 1987; Kuan and White 1994; Wiegerinck and Heskes 1994). The ODE in equation 3.18 approximates in lowest order the dynamics of the weights. The covariance matrix C’(t), which obeys f,,) between equation 3.19, describes the stochastic deviations w ( n )- wODE( the weights and the ODE approximation. These fluctuations are typically of order 8. [Their ”square” C’(t) is of order v.] In Benviste et al. (1987) and Kuan and White (1994) a Wiener process is rigorously derived to describe these fluctuations. In Wiegerinck and Heskes (1994) a FokkerPlanck equation that describes these fluctuations is derived. In the next section we will study how these fluctuations affect some asymptotic error measures. Equation 3.20 decribes a bias u between the mean W and the ODE approximation wODE.The dynamics of the bias consists of two driving terms. The first one is the interaction between the nonlinearity of the learning rule Q and the fluctuations described by C2. This term can be understood in the following way: If a random fluctuation into one direction in weight space does not result in the same restoring effect as a random fluctuation into the opposite direction, then random fluctuations will obviously result in a netto bias effect. The other driving term in equation 3.20 is B (see equation 3.9). This term is only due to the dependencies of the examples. Since the two driving terms are typically of order 7, the bias term is also typically of order rl. The bias is typically an order fi smaller than the fluctuations and therefore neglected in regular situations. However, in Section 5 it will be shown (and this will be supported by simulations) that there are situations where this bias term is of crucial importance.
Wim Witgeriiick and Tom Heskes
1752
As an approximation, the set of coupled equations 3.18-3.20 is equally \,did as the coupled set 3.13 and 3.14. However, in 3.18-3.20 the hierarchical structure of the approximations (,ODEapproximation, fluctuations, bias) is clearer. The influence of the example presentation on the evolution of the weight distribution is twofold. On the one side, dependencies between examples affect the covariance Y’ through the diffusion term D (see equation 3.15). On the other side, they affect the mean value through the vector B, and indirectly through the covariance Y’. For independent examples, D reduces to C,, (see equation 3.17) and B = 0 exactly. Finally we want to stress that the essential assumption for the validitv of equations 3.18-3.20 is that the weights can be described by their average value with small superimposed fluctuations. In other words, the approximation 3.18-3.20 is locally valid. This is the case if the Hessian H is positively definite. In other cases the approximation is \,did only for short times (van Kampen 1992). In the analysis of the next two sections bve tacitly assume this local 17alidity. 4 Representation Error and Prediction Error
In this section we show how dependencies between successive examples influence the asymptotic performance of the network. In the asymptotic situation, the weights are assumed to remain concentrated around a minimum iu- of the global error E ( zu). We consider two measures o f network performance: the ”representation error” ErCFrand the ”prediction error” E,,rt,cj. The meaning of this terminology differs slightly from its usual meaning in most neural network literature. The representation error
(4.1) is the expectation of the asymptotic global error E [w( x ) ](cf. equation 2.4) made by the network. It is a useful measure to compare different example presentation techniques if the goal is minimization of the local cost function p ( z 0 . x ) in an environment given by a probability distribution / I ( # ) . In the context of time series, Erepr measures how well the asymptotic network state is expected to represent the whole time series. The prediction error Epred
ici~ ( e [ w ( I I ) . x ( I I ) ] )
(44
is the error that the network in its final stage of learning is expected to make on the m x f example of the time-series. Eprcd measures the error locally in time, in contrast to the more global measure Errpr. The weights are assumed to be concentrated around a minimum w* of the global error E ( w ) .This implies
VE(w*)
=
0
On-Line Learning
1753
The fluctuations C2 and the bias u = ( w ( m ) )- w*satisfy in lowest order the fixed point equations of 3.19 and 3.20
+
H ( w * ) C 2 C2H(w*) =
H(w*)u
=
?jD(W*) 1
--Q(w*):C2 - / / B ( w * ) . 2
With the techniques used in the previous section we calculate the two error measures up to O(71).To obtain the representation error in equation 4.1 we expand E(w)around its minimum w*, Erepr =
!:ir (E [ ~ ( n ) ]=) =
1 + -Tr 2 71 E(w*)+ -Tr 4
E(w*)
[H(w')E2] [D(w*)]
+ ...
+ ... .
To calculate the prediction error in equation 4.2, we apply a time-averaging procedure similar to the one used in Section 3. Given weight vector w ( n ) before the first learning step, the local error over the next M examples is 1 M-I -
C e [w(n+ m).x(n + m ) ]
1 M-l =
111=O
-
M
C e [ w ( n ) , x ( n+ m ) ]
n=O
'/
c CfT[ w ( n ) . x ( n+ m)l
M-llrl-'
- -
n = O l=O
x f [ w ( n ) . x ( n + l ) ]+ ' . ' . For a mesoscopic timescale M we obtain, using the definitions from equations 3.16 and 3.17,
Epred
=
1 M-1 lim 11-CL. M m=O (e [w(n m ) . x ( n m ) ] )
=
Erepr -
+
+
a2 Tr[C+(w*)]+ . . . .
For randomized learning, the prediction error and representation error are equal: Epred = Erepr = E,,. Using D(w*) = C"(w*),we obtain
Era,
= E(w*)
+ 4 Tr[Co(w*)]+ . . . . -
If we compare the representation and prediction error with dependent examples to the error with independent examples (assuming that the weight distribution is concentrated around the same minimum w*), we see that, up to order 71, the profit in prediction exactly cancels the loss in representation and vice versa: Epred
+ Erepr 2
=
Era,
+ ....
Wim Wiegerinck and Tom Heskes
1754
In the context of strategies to select examples, this implies that a strategy that yields a larger prediction error will most likely lead to a smaller representation error. Depending on whether successive weight changes are, roughly speaking, positively or negatively correlated, the prediction error is smaller or larger than the representation error, respectively. This is nicely illustrated by the following simple example. We consider a process where the examples can take two values x = f l with transition probability = (1-
7lXl.Y')
q)?,.,, + qh,.-,,.
i.e., there is a probability 0 < q 5 1 to flip the sign of the input. The stationary distribution / t ( x ) is given by 1 p i s j = - (6,- 1 t + , , I ) (4.3) 2 A one-dimensional "weight" ill tries to minimize the squared distance between the presented example and the weight; i.e., the local error is 1 -xi 2 2 and the corresponding update rule [cf. equations 2.1 and 2.31 is
e(ic,.s)
= -(RJ
Ail1 =
r/ ( s
~
(4.4)
it)).
The global error E(7 u ) (cf. equation 2.4) is obtained by averaging the local error (cf. equation 4.4) over the stationary distribution (cf. equation 4.3),
and has a minimum E ( i u * ) = 1 / 2 for ~1~ = 0. To compute the performance measures from equations 4.1 and 4.2 for our simple unsupervised example as a function of the flip probability 11, we first calculate the autocorrelations C , , , ( ~ D (cf. * ) equation 3.17) in the minimum 70* = 0: C,,,(O) = (u(tn)x(Oj),
C,
and
=
-
(1 - 217)'''
and thus
Co
=
1
= ___
Y
Up to 0( / I ) we obtain 1 3q-1 EpTed
=
1 - 2q
+ ---!I
and
Ercpr =
1 -
+ --ti1 - q
2 49 2 4r7 For flip probability q < 112 we have better prediction than representation; for q > 1 / 2 better representation than prediction ( q = 1/2 corresponds to randomized learning). This is what we could expect: The larger the flip probability, the better the overall sampling of the input space for the problem at hand (finding the average input) and thus the better the representation. However, the larger the flip probability, the more difficult to predict the next example for the network that has just been adapted to the current example.
On-Line Learning
1755
5 Plateaus
In comparing the asymptotic performance of networks trained with dependent and independent examples in the previous section, we assumed that with small 17, both types of learning lead to the same (local) minimum of the global error E(w) (see equation 2.4). This is not unreasonable if the learning process is initiated in the neighborhood of this minimum. A minimum is a stable equilibrium point of the ODE dynamics (cf. equation 3.5), i.e, the eigenvalues of the Hessian H(w) (see equation 3.8) are strictly positive. In the neighborhood of a minimum, the ODE force F(w) (see equation 3.4) is the dominating factor in the dynamics. Perturbations due to the higher order corrections are immediately restored by the ODE force. In this section, however, we will consider so-called "plateaus." Plateaus are flat spots on the global-error surface. They are often the cause of the extremely long learning times and/or the bad convergence results in multilayer perceptron applications with the backpropagation algorithm (Hush et al. 1992). On a plateau, the gradient of E is negligible and H has some positive eigenvalues but also some zero eigenvalues. Plateaus can be viewed as indifferent equilibrium points of the ODE dynamics. Even with small q, the higher order terms can make the weight vector move around in the subspace of eigenvectors of H with zero eigenvalue without being restored by F . In other words, in these directions the higher order terms-in the first place the fluctuations, which are of order fi, and in the second place the bias, which is of order ?/-may give a larger contribution to the dynamics than the ODE term. Since the higher order terms are related to dependencies between the examples, on plateaus the presentation order of examples might significantly affect the learning process. The effect of different example presentations in learning on a plateau will be illustrated by the following example. We consider the tent map
y(x) = 2(1/2 - Ix - 1/21).
05 x 5 1
which we view as a dynamic system producing a chaotic time series x(n + 1) = y [x(n)] (Schuster 1989). To model this system we use a twolayered perceptron with one input unit, two hidden units, and a linear output,
z(w,x) = uo+
L
.x+ woo).
z18 tanh(wp1 p=1
We train the network with input-output pairs x backpropagation (Rumelhart et nl. 1986)
Aw
=
-r/V7,,e(w,x)
=
{x.y(x)} by on-line
Wim Wiegerinck and Tom Heskes
1756
with the usual squared error cost function c(zu.x)
!y(s) - z(zu.s)!’i2.
We compare two types of example presentation. With natural learning, examples are presented according to the sequence generated by the tent map, i.e. x(0) =
{X(O).!/(X(O))}~
x(l) =
{X(I).!/(.T(l))}.
. . . . X(I7I
= {.Y(il).y(.Y(?i))}
with S ( I I i 1) = y jr(n)] [and s(0) randomly drawn from the interval 10. I;]. With randomized learning, at each iteration step an input x is drawn according to the stationary distribution /I( s),i.e., homogeneously from the interval [O. I] (Schuster 1989), the corresponding output y(x) is } presented to the network. In both computed, and the pair { x . y ( . ~ ) is io ick < t . Small cases we initialize with the same small weights, --F < random weights are often recommended to prevent early saturation of the weights (Lee et nl. 1991). As reported earlier (Hondou and Sawada 1994), simulations show a dramatic difference between the two learning strategies in their performance learning the tent map (cf. Figs. 1 and 2 ) . To understand this difference, we will study the weight dynamics by local linearizations. In the neighborhood of a point zu’ in weight space the ODE from equation 3.5 can be approximated by zj-,.
The weights are initialized at zu(0)= C7 ( c ) , with f 2 0. The linearization (cf equation 5 I ) around zu’ = ZU“” - 0 yields an approximation of the weight dynamics during the initial stage of learning,
with i = 1.2. From equation 5.2, we see that Fo quickly converges to = 1 / 2 on a time scale where the other weights hardly change (cf. Fig. 3). In other words, during this stage the network just learns the average value of the target function. This is a well-known phenomenon: Backpropagation tends to select the gross structures of its environment first. After the initial stage, equation 5.2 does not provide a good approximation any more. The linearization (cf. equation 5.1) of the ODE around )I I 1 I l l the new point zu‘ = zo” = ( i o = 1/2. z’, = 0. ill’;,: = 0), (with ( I = 0.1 and 1’ 1.2), describes the dynamics of the weights during the next t;tage, ~~
Zl,,
~
&-Line Learning
1757
E 0.15
0.1
n
10'
Figure 1: Typical global error E of natural learning (full curve) and of randomized learning (dashed curve). Simulation performed with a single network. Learning parameter 71 = 0.1. Weights initialization: E = lop4. Data points are plotted every lo4 iterations.
with /3 = 1 , 2 . At this stage, F = 0, while the Hessian H has one positive eigenvalue (A = 1)and further only zero eigenvalues. In other words, at w(l) the weights are stuck on a plateau. To find out whether the weights can escape the plateau, we have to consider the contributions of the higher 71 corrections to the weight dynamics from equations 3.13 and 3.14. Linearization of this set of equations around w(') yields
dw(t) ~
at
=
-H(w"))[W(t) - w("]
Wim Wiegerinck and Tom Heskes
1758
1
..
I
I
0.8 c3
7
-4J
0.6
7
0
0.4
0.2 C
0.4
0.2
0.6
0.8
1
input
Figure 2 : Typical network result after lo6 iteration steps of natural learning (full cur\'e) and randomized learning (dashed cur\,e). The target function is the tent map (dotted curve). For simulation details, see caption of Figure 1.
At zu"', the ( v o . vo) component is the only nonzero component for both the Hessian H and the diffusion D (for randomized learning as well a s for natural learning). From equation 5.4, it thus follows that E:,,,,?,)is the only nonzero component of the covariance matrix. So there will be fluctuations only in this direction. However, these fluctuations will be restored, due to the positive ( Z J ~ ) .i l o ) component of the Hessian. Moreover, since Q(w"' ) z , , , T . , , T u (see equation 3.11) and its derivatives vanish for all ZU, the covariance matrix S 2 does not couple with the (linearized) weight dynamics, and equation 5.3 reduces to the autonomous equation LIE(t 1 dt
-
"1
H ( ZO" I ) [Z(t ) - ZU'
-
,I
{ B(ZO'I
I ) 4
VB(ZU"
I)
[%(t ) - ZO")]} .
With natural learning, straightforward calculations yield B(w'" 1 = 0 and V B ( w ' l ' )= 0, except for the components
On-Line Learning
-0.51 0
1759
. 5ooo
1 n l r n
-0.05 0
5M)o
n
1Moo
Figure 3: Weights obtained by simulations for natural learning (solid curves) and randomized learning (dashed curves) as functions of the number of iterations. Averaged over 100 iterations and an ensemble of 20 networks. The theoretical predictions computed with equation 5.5 are plotted as dotted curves. with /j = 1,2. Concentrating on the dynamics of V p , and mpl, we thus obtain the linear system
(5.5) with P = 1,2. This system has one negative eigenvalue A- and one positive eigenvalue A+ = 7%. Along the direction of the eigenvector (1.-1) corresponding to the positive eigenvalue, the weights will move away from d1) (cf. Fig. 3). Thus, natural learning escapes from the plateau, and reaches the global minimum (cf. Figs. 1 and 2). On the other hand, for randomized learning B = 0 identically. This means that the weights of a randomized learning network are not helped by the higher 7 corrections and therefore cannot escape the plateau (cf. Figs. 1 and 2). Figure 3 shows that the predictions computed with the theory agree well with the simulations of the neural network learning the tent map, and therefore we conclude that the difference in performance of the two learning strategies is well explained by the theory. The analysis of this section-supported by the simulations-shows that if the learning process suffers from a plateau, then dependencies can help learning by a nonzero B term (cf. equation 3.9) with some posi-
1760
Wim Wiegerinck and Tom Heskes
ti1.e eigenvalues. Of course, the magnitude of these eigenvalues and the direction of the corresponding eigenvectors depend strongly on the problem at hand, i.e., B is probably not for every problem directed toward a global minimum. But the fact remains that a nonzero B term can make the weights move niila!/ from the plateau, which facilitates an escape, resulting in a lower error. On the other hand, if a nonzero B does not make the weights wander away, or if it does not lead to an escape, the performance of dependent learning is still not worse than the performance of randomized learning, which also would get stuck on the plateau. Another likely situation occurs if randomized learning does escape from the plateau, e.g., as a result of the fluctuations in the direction of a zero eigenvalue of the Hessian. In this situation the bias terms-which are an order Jii smaller than the fluctuations-can be neglected. In such a case dependent pattern presentation probably does not harm eithersince similar fluctuations would also enhance escaping from the plateau with dependent patterns-unless the presentation order reduces the fluctuations too much! For instance, in the example of Section 4 the fluctuations are reduced if the examples are negatively correlated ( q > 1/2). As a more realistic example, consider a problem with a fixed training set of P examples. A commonly used incremental learning strategy presents in each epoch of P learning steps each example only once (Haykin 1994). In other words, the patterns are aranged in a randomly ordered sequence 'xi 1 1. . . . . x ( P ) ] .It is obvious that this sequence-based or cyclic learning introduces dependencies between the examples. Moreover, the subsequent examples are negatively correlated. This follows from the fact that the probability to find identical subsequent examples is on average at least order P smaller in cyclic learning than in randomized learning. Indeed, it can be shown analytically that the leading term of the fluctuations completely vanishes in cyclic learning (Heskes and Wiegerinck 1996). As a consequence, randomized learning has a much larger chance to escape from a plateau than cyclic learning. In conclusion, we recommend natural learning (with positive correlations) if the problem at hand suffers from a plateau. However, artificial dependencies introduced to reduce fluctuations are in such a case not advisable. 6 Summary and Discussion
This paper presents a quantitative analysis for on-line learning with dependent examples in a very general form. The analysis is based on two essential ingredients. One is the separation between the time scales of the example presentation and the weight dynamics. On the time scale needed for a representative sampling of the environment the weight changes must be negligible. A separation of time scales, which can be achieved using a small learning parameter, is essential in on-line learning to pre-
On-Line Learning
1761
vent overspecialization on single examples. The other essential ingredient is the assumption that the weights can be described by their average value with small superimposed fluctuations. In other words, the theory is locally valid, and may therefore not be suited for quantitative computations of global properties of the learning process, such as the stationary distribution of weights or the escape time out of a local minimum. However, even a local theory can be useful to understand some aspects of global properties (Finnoff 1994). Our study of learning on plateaus is an example of a local analysis of on-line learning, which accounts for huge, nonlocal effects (Section 5). In Section 3 we heuristically derived the first terms in a hierarchy of deterministic differential equations approximating the stochastic learning process. The leading term, the ODE term, only contains information of the stationary distribution of the examples. Dependencies between successive examples do not enter until the first correction to the ODE term. This implies that in general, when the ODE term is dominant, learning with dependent examples and learning with randomized examples are alike. The dependencies between examples merely act as corrections on the learning process, both in the fluctuations and in the bias. A rigorous derivation of the leading term, the ODE term, and of the Wiener process describing the fluctuations, can be found in Benviste et al. (1987) and Kuan and White (1994). To our knowledge, a rigorous derivation of higher order terms, such as the bias term in equation 3.20, has not been studied before. In Section 4 we focused on the asymptotic convergence of the learning process in terms of representation error and prediction error. The representation error is the expected average error of the network with respect to the whole environment. It measures how well the environment is represented by the network after learning. The prediction error is the network’s expected average error with respect to the next presented example. It can be viewed as a measure for the irregularity of the example presentation. A remarkable relation between representation and prediction error is that the more predictable the examples, the larger the representation error. In Section 5 we studied on-line learning with a plateau. Plateaus are flat spots on the error surface that can severely slow down the learning process. In particular, backpropagation for multilayer perceptrons often suffers from plateaus. On a plateau the ODE contribution vanishes. The higher order terms, which contain the dependencies, therefore dominate the learning process. Simulations of a multilayer perceptron with backpropagation learning the tent map demonstrate that dependencies between successive examples can dramatically improve the final learning result. This phenomenon is explained by our analysis, which evidences that randomized learning gets stuck on a plateau, whereas the dependencies in natural learning cause the escape from the plateau. Predictions computed with the theory agree well with the simulations. At the end of
1762
Wim Wiegerinck and Tom Heskes
this section we motivated our conjecture that if backpropagation suffers from plateaus, then dependencies (with positive correlations) in example presentation can be helpful, and at least will not do any harm. At this point, we remark that this paper focuses only on the learnitzg process. In practical cases one often has access to a limited number of training data. In such a case, at the global minimum the network model might overfit the data and this training optimum may therefore not be optimal for generalization purposes (Chauvin 1990). Actually, Hochreiter and Schmidhuber (1995) present an algorithm that searches for flat spots to achieve a better generalization. However, issues of generalization and overfitting on a limited set of training examples, though important and interesting, are beyond the scope of the current paper. For convenience, the paper has been restricted to learning with a constant learning parameter in a stationary environment; i.e., the transition probability r(x1x’)between successive examples is independent of time. The theory can be extended straightforwardly to learning with timedependent learning parameters //( t ) in a changing environment (Ben\%te et d . 1987; Heskes and Kappen 1992), i.e., with a time-dependent transition probability :(xlx’; t ) , as long as the time scales of the learning parameter and the changing environment are large compared to the time scale of the learning process, and as long as this last time scale remains large compared to the time scale of the example presentation. As a consequence, time-dependent example selection techniques (Munro 1992; Cachin 1994; Ludik and Cloete 1994), possibly combined with a time-dependent learning parameter, may be devised and evaluated analytically. For instance, one can think of a scheme starting with dependencies designed to avoid plateaus and continuing in a later stage with dependencies designed for the fine tuning around minima. Perhaps such schemes will relate to common sense, like the pedagogical idea that the presentation of examples should start simple and gradually increase in complexity. In fact, as long as the three previously mentioned time scales remain separated, the theory may also include weight-dependent transition probabilities r(xlx’;w.f ) (Benviste et al. 1987). The vector x does not neccessarily represent an example. It may have components describing other fast variables. For instance, fast variables have been utilized to study learning with momentum (Wiegerinck et al. 1994), where the adaptation rule does not satisfy equation 2.1. Other obvious candidates for fast variables in neural network theory may be rapidly changing neuron states in recurrent networks. Thus, our framework may be applied to the analysis of the joint dynamics of neurons and weights (Penney et al. 1993). In conclusion, the techniques for the local approximation of stochastic processes with separate time scales prove to be powerful tools for the analysis of on-line learning in neural networks.
On-Line Learning
1763
Acknowledgments We thank the referees for their useful suggestions.
References Amari, S. 1967. A theory of adaptive pattern classifiers. I E E E Transact. Electronic Comput. 16,299-307. Barnard, E. 1992. Optimization for training neural nets. I E E E Transact. Neural Networks 3 232-240. Benviste, A,, Metivier, M., and Priouret, P. 1987. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, Berlin. Brunak, S., Engelbrecht, J., and Knudsen, S. 1990. Cleaning up gene databases. Nature (London) 343, 123. Cachin, C. 1994. Pedagogical pattern selection strategies. Neural Networks, 7, 175-181. Chauvin, Y. 1990. Generalization performance of overtrained back-propagation networks. In Lecture Notes in Computer Science, L. Almeida and C. Wellekens, eds., Vol 412, pps. 46-55. Springer-Verlag, Berlin. Finnoff, W. 1994. Diffusion approximations for the constant learning rate backpropagation algorithm and resistance to local minima. Neural Comp. 6, 285295. Hansen, L., Pathria, R., and Salamon, P. 1993. Stochastic dynamics of supervised learning. J. Phys. A 26, 63-71. Haykin, S. 1994. Neural Networks, A Comprehensive Foundation. MacMillan, Hamilton, Ontario. Hertz, J., Krogh, A,, and Palmer, R. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA. Heskes, T. 1994. On Fokker-Planck approximations of on-line learning processes. J. Phys. A, 27, 5145-5160. Heskes, T., and Kappen, B. 1991. Learning processes in neural networks. Phys. Rev. A 44, 2718-2726. Heskes, T., and Kappen, B. 1992. Learning-parameter adjustment in neural networks. Phys. Rev. A 45,8885-8893. Heskes, T., and Wiegerinck, W. 1996. A theoretical comparison of batch-mode, on-line, cyclic, and almost cyclic learning. I E E E Trans. Neur. Networks 7 (4). Hochreiter, S., and Schmidhuber, J. 1995. Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems 7, G. Tessauro, D. Touretzky, and T. Leen, eds. Morgan Kaufmann, San Mateo, CA . Hondou, T., and Sawada, Y. 1994. Analysis of learning processes of chaotic time series by neural networks. Prog. Theoret. Phys. 91, 397402. Hoptroff, R. 1993. The principles and practice of time series forecasting and business modelling using neural nets. Neural Comput. Appl. 1, 59-66. Hush, D., Horne, B., and Salas, J. 1992. Error surfaces for multilayer perceptrons. IEEE Transact. Syst. Man Cybern. 22, 1152-1161.
1764
Wim Wiegerinck and Tom Heskes
Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biol. Cyberrz. 43, 59-69. Kuan, C., and White, H. 1994. Artificial neural networks: An econometric perspective. Ecorzoniet. Rev. 13. Lapedes, A,, and Farber, R. 1988. How neural networks work. In Nrirrnl~fzfor777ntioir Processirig Systrnzs, D. Anderson, ed., pps. 442456. American Institute of Physics, New York. Lee, Y., Oh, S., and Kim, M. 1991. The effect of initial weights on premature saturation in backpropagation learning. Iiitt.. Ioirit Coiif. Neuml Netwouks, PPS. 765-770. IEEE. Leen, T., and Moody, J. 1992. Weight space probability densities in stochastic learning: I. Dynamcis and equilibria. In Ad7~7ricesirz N~zrrnlIrfovniatioti Proc-cssirrg Systtv7r.s 5, S . Hanson, J. Cowan, and L. Giles, eds., pps. 451458. Morgan Kaufmann, San Mateo, CA. Ludik, J., and Cloete, I. 1994. Incremental increased complexity training. In Proirediiigs qftlir Eirropcnrr S!/riiposiirni 0 1 7 Avttficinl N P I I I .N~ ~/ t i ~ o r ’94, k s M. Verleysen, ed., pps. 161-165. D Facto, Brussels. Mpitsos, G., and Burton, M. 1992. Convergence and divergence in neural networks: Processing of chaos and biological analogy. Neirrnl Ntltiilorks 5, 605625. Munro, P. 1992. Repeat until bored: A pattern selection strategy. In Advniicrs i f 7 N w m l Iiforrrintioii Procc.ssirig Systerris 4, J. Moody, S. Hanson, and R. Lippman, eds., pps. 1001-1008. Morgan Kaufmann, San Mateo, CA. Orr, G., and Leen, T. 1992. Weight space probability densities in stochastic learning: 11. Transients and basin hopping times. In Adzlnrices iri Neirval Ir@wntioii Procrssiiig Systtws 5, S . Hanson, J. Cowan, and L. Giles, eds., pps. 507-514, Morgan Kaufmann, San Mateo, CA. Penney, W., Coolen, A,, and Sherrington, D. 1993. Coupled dynamics of fast spins and slow interactions in neural networks and spin systems. 1. l’liys. A 26, 3681-3695. Radons, G. 1993. On stochastic dynamics of supervised learning. J. Plrys. A 26, 3455-3161. Ritter, H., and Schulten, K. 1988. Convergence properties of Kohonen‘s topology conserving maps: Fluctuations, stability, and dimension selection. B k ~ l . C!/bPrFi. 60, 59-71. Ruinelhart, D., Hinton, G., and Williams, R. 1986. Learning representations by back-propagating errors. Nnfirw (Loriiioril 323, 533-536. Schuster, H. 1989. Dt7fcrriziiiistic Clinos, 2nd rev. ed. VCH, Weinheim. van Kampen, N. 1992. Stoclinstic Pvocrsscs iri Physics nrid Clienristy. NorthHolland, Amsterdam. Weigend, A., and Gershenfeld, N., eds. 1993. Prrdictirig tlir Futirre mid Undrrst[iriifiri~y the. Plist: A C ~ ~ n i p ~ cfAppronc-lics. i~oii Addison-Wesley, Reading, MA. Weigend, A., Huberman, B., and Rumelhart, D. 1990. Predicting the future: A connectionist approach. I t i t . I. Neirrd Sysf. 1, 193-209. Werbos, P. 1974. Beyond regression: New tools for prediciton and analysis in the behavioral sciences. Ph.D. thesis, Harvard University.
On-Line Learning
1765
White, H. 1989. Some asymptotic results for learning in single hidden-layer feedforward network models. J. Amer. Stat. Assoc. 84, 1003-1013. Wiegerinck, W., and Heskes, T. 1994. On-line learning with time-correlated patterns. Euuophy. Lett. 28, 451455. Wiegerinck, W., Komoda, A., and Heskes, T. 1994. Stochastic dynamics of learning with momentum in neural networks. J. Pkys. A 27, 44254437. Wong, F. 1991. Time series forecasting using backpropagation networks. Neurecomputing 147-159. ~-
Received January 27, 1995; accepted February 15, 1996.
This article has been cited by: 2. Tom Heskes . 2000. On “Natural” Learning and Pruning in Multilayered PerceptronsOn “Natural” Learning and Pruning in Multilayered Perceptrons. Neural Computation 12:4, 881-901. [Abstract] [PDF] [PDF Plus] 3. A.S. Younger, P.R. Conwell, N.E. Cotter. 1999. Fixed-weight on-line learning. IEEE Transactions on Neural Networks 10:2, 272-283. [CrossRef] 4. Tom Heskes, Jeroen Coolen. 1997. Journal of Physics A: Mathematical and General 30:14, 4983-4992. [CrossRef]
Communicated by Drew Van Camp
Autonomous Design of Artificial Neural Networks by Neurex Franqois Michaud Ruben Gonzalez Rubio Department of Electrical and Computer Engineering, Universitt'de Sherbrooke, Sherbrooke ( Q u e k c ) , Canada J1K 2R1 Artificial neural networks (ANN) have been demonstrated to be increasingly more useful for complex problems difficult to solve with conventional methods. With their learning abilities, they avoid having to develop a mathematical model or acquiring the appropriate knowledge to solve a task. The difficulty now lies in the ANN design process. A lot of choices must be made to design an ANN, and there are no available design rules to make these choices directly for a particular problem. Therefore, the design of an ANN demands a certain number of iterations, mainly guided by the expertise and the intuition of the developer. To automate the ANN design process, we have developed Neurex, composed of an expert system and an ANN simulator. Neurex autonomously guides the iterative ANN design process. Its structure tries to reproduce the design steps done by a human expert in conceiving an ANN. As a whole, the Neurex structure serves as a framework to implement this expertise for different learning paradigms. This article presents the system's general characteristics and its use in designing ANN using the standard backpropagation learning law. 1 Introduction
In the past decade, artificial neural networks (ANN) have demonstrated their growing abilities in solving complex, nonlinear applications that are difficult to handle with more conventional mathematical or procedural models. Their principal advantage is their learning ability, making it possible to acquire the appropriate knowledge directly from data to solve a given task. It is then unnecessary to develop a mathematical model or to extract the appropriate knowledge from experts. But this difficulty is transferred to the ANN design process. There are no available ANN design rules that allow us to determine directly the best learning law and its associated design parameters for any given task. Therefore, the design of ANN for a particular application demands a certain number of iterations to find good design parameters that can be used to train an ANN capable of achieving the specified performance objectives dictated Neural Computation 8, 1767-1786 (1996) @ 1996 Massachusetts Institute of Technology
1768
Franqois Michaud and Ruben Gonzalez Rubio
by the application. This empirical method is greatly influenced by the expertise and the intuition of the developer. For an unskilled user or even for an expert confronted with a new application, finding an adequate ANN following this iterative process can be a laborious task. Therefore, it would be interesting to develop a system that could automatically and intelligently control the design phases of ANN, alleviating the developer's task. In that regard, we present in this article an autonomous system capable of designing ANN by reproducing the skills of an expert developer. It is composed of an expert system coupled with an ANN simulator. The useful design expertise for developing ANN is implemented in these two constituents. The expert system is responsible for choosing the design parameters based on its analysis of simulation results. According to the expert system demands, the ANN simulator trains and validates ANN, tries to detect harmful conditions during the training phase, and gives back useful information to the expert system (Michaud 1993). The overall system was named Neurex (NEUral EJXpert). To validate underlying Neurex principles, we have restricted our study to the standard backpropagation learning rule with momentum descent. This paper presents the general characteristics of Neurex and its use in developing standard backpropagation ANN. Section 2 presents Neurex architecture and its two major constituents used to supervise the ANN design process autonomously. Sections 3 and 4 describe the expertise and characteristics implemented in Neurex for standard backpropagation ANN design, the first learning law incorporated in it. Results are presented in Section 5. Finally, conclusions and future research directions are drawn in Section 6. 2 Autonomous ANN Design Process Supervised by Neurex
~
Given suitable learning conditions, ANNs have the ability to learn the mapping function of an application directly from data. However, ANN flexibility makes it difficult to know a priori the appropriate design choices (Caudill 1991). There is a great variety of ANN models, learning paradigms, and analysis tools, such as the ones implemented in commercial simulators like Neuralware (Neuralware 1993) or MacBrain. The ANN designer must then find suitable design conditions through iterative experimentation and proper analysis of simulation results. Figure 1 represents the ANN development methodology commonly followed by ANN designers. This design process is made difficult because each decision step has unknown influences on the others. Design choices must then progress according to the experimentation results, making it difficult to develop mathematical ANN design formulas to determine directly the design choices for any given task. The design process is even more difficult for inexperienced users, who must famil-
Autonomous Design of Artificial Neural Networks by Neurex
1769
[ Task Specification Collection of Training and Testing Data
Choice of the Learning Law and ANN Model
.
I
I
I
1-1
Result Analysis
Controlled by Neurex
I
Appropriate ANN Configuration
Figure 1: ANN design methodology.
iarize themselves with the technique of artificial neural networks. Users must also learn, through experimentation, the design choice interdependencies and their influences on ANN performance. Even if commercial simulators like Neuralware and MacBrain facilitate the access to ANN by their user-friendly interface and their design tools, these simulators cannot make the design choices for the designer. In addition, some researchers have noted the lack of standardization in the ANN design process and the need to automate it (Lendaris 1990) for faster development and greater accessibility. To our knowledge, two other systems tackle the problem of autonomous ANN design process. First, Wah and Kriplani (1990) use an expert system to automate the process of standard backpropagation design. The expert system is used as a search mechanism to find the best learning configuration from the set of possible design parameters given by the user. Design configurations are simulated simultaneously and are evaluated after each time quantum in order to select the most promising network configurations.
Franqois Michaud and Ruben Gonzalez Rubio
1770
Neurex User
Application Optimal Information Configuration
ex Configuration
Expert
System
ANN Simulator Results
Figure 2: Neurex architecture
The other system uses a genetic algorithm for a general and systematic method for ANN design (Harp et al. 1990). The ANN design parameter configurations are chosen by genetic operations according to simulation results and performance criteria for the given application. This approach is model-independent ANN design but is also validated for the standard backpropagation learning rule. It seems to work well, but one of its limitations is that its efficiency greatly depends on the initial population of design configurations. These systems do not benefit from the expertise commonly used by ANN designers. In an attempt to d o so, we developed Neurex. According to the classification of neurosymbolic systems from Hilario (1995), Neurex is in the category of a functional hybrid system, with metareasoning coupling. As shown in Figure 2, Neurex is composed of an expert system and an ANN simulator. The expert system is implemented using the flex expert system shell’ from Quintus, while the ANN simulator is written in C. The control exchange is realized simply by a function call by the expert system to the ANN simulator, and information exchange is made using text files. By using these two constituents, Neurex is responsible for the design choices unrelated to the task, as illustrated in Figure 1. It does so by trying to reproduce two types of expertise used in the design of ANN. The first type is expertise for ANN design. This expertise can be implemented by rules in the expert system in a three-step process:
’ Implemented on top of Prolog
Autonomous Design of Artificial Neural Networks by Neurex
1771
1. Get application characteristics from the user. These characteristics, such as type, training, and testing success criteria2 and data sets information, influence the following choices. 2. Select the learning law, the ANN model, and the initial design parameters. These choices can be made directly by the user, allowing Neurex to benefit from the user's expertise. The expert system can also search in its database of previous cases for applications that show some similarities to the current task and use their design parameters to make these choices. A third option is to let the expert system propose default values for a learning paradigm based on the general characteristics of the task. 3. Activate the design heuristic for the chosen learning paradigm. The design heuristic supervises the design process iterations by analyzing simulation results and selecting the appropriate design parameters based on performance tendencies observed during the iterative design choices. The ANN simulator is activated by the expert system design heuristic to simulate each new ANN design configuration. When the iterative heuristic-controlled design process is completed, the expert system displays the best configuration found and memorizes the application characteristics and results in its database for future reference.
Training expertise is the second type. ANN design time is mainly spent for ANN training. Therefore, to optimize the design process time, it is essential to train ANN efficiently. To do so, the human ANN designer can analyze factors during training to detect harmful situation^.^ Neurex ANN simulator is responsible for performing these kinds of analysis and stopping the training process when problem situations are detected. It also provides the relevant information needed for the expert system design supervision. Unlike common ANN simulation software, our ANN simulator does not have interactive software tools to help the human developer; rather it supervises the simulations by itself. It is developed for autonomous ANN simulation according to the expertise implemented in its supervision modes (Michaud et al. 1995). 3 Standard Backpropagation Design Characteristics Considered by Neurex
To validate the usefulness of Neurex, it is important to verify the possibility of extracting and implementing a design expertise for a learning law showing some difficulties in the design process. We have chosen to use the standard backpropagation learning law (Rumelhart et al. 1986) with momentum descent and batch processing, using a fixed learning rate 'Such as the error tolerance or the root mean square (RMS) error. 3Like the Occurrence of oscillations or plateaus in the RMS learning curve.
I772
Franpis Michaud and Ruben Gonzalez Rubio
:and momentum
Standard backpropagation is a simple learning law that proved to be useful in many different types of applications. It must, however, cope with many difficulties such as local minima (HechtNielsen 1989) and slow conl’ergence. Using backpropagation learning makes it necessary to determine many design parameters, like the number of hidden units (H),the range of the random initial weights ( k r ) , the learning rate (:), and the momentum ( ( I ) . In addition, it seems practically impossible to establish general scaling relations to adjust these parameters as the size and the complexity of the learning task grow (Kung and Hwang 1988; Plaut and Nowlan 1986; Tesauro and Janssens 1988). Finally, all of these factors influence the ANN generalization ability, which cannot be known precisely in advance. Variations of the backpropagation learning law try to adjust these parameters dynamically (Drago and Ridella 1992; Fahlman and Lebiere 1989; Jacobs 1987). These kinds of learning paradigms simplify the ANN design process. Therefore, for validating the concept of autonomous A N N design according to the underlying principles of Neurex, it is preferable to use a learning paradigm that has many design parameters not dynamically modified. In doing so, we could demonstrate the usefulness of the approach for difficult and varied ANN design conditions. Consequently, because of its popularity, advantages, and problems, this learning law has revealed itself to be a great starting point for the development of Neurex. For the other ANN characteristics, we have considered only feedforward, fully connected ANN with bias units. The activation function for hidden and output neurons is the conventional sigmoid function varying between 0 and 1. Also, only three layers (input, hidden, and output) of units are used. This restriction has been made only to simplify the design process by not having to determine the number of layers and because it has been proved by Hecht-Nielsen (1989) that three-layer ANN can solve the same kinds of problems as multihidden layer ANN. All of these design choices must not be interpreted as an evaluation concerning the best learning paradigm to use for ANN training. Our main objective is rather to demonstrate the possibility of intelligently and automatically supervising the ANN design process by implementing useful design expertise in Neurex. The design parameters associated with backpropagation ANN are those that must be optimized during the design process. In addition to H. Y. E, and ( I design parameters, two other parameters are used for training specifications. First, the maximum number of training epochs (NE) must be set to stop the learning phase when it seems impossible for ANN to fulfill the training success criterion. Second, because the weights are initialized randomly, one trial cannot prove the validity of the simulation results (Fahlman 1988). To ensure this, a number of trials (NT)must be set for each design configuration. ‘See the Appendix for a list of symbols used in this article
Autonomous Design of Artificial Neural Networks by Neurex
1773
4 Design Expertise Used by Neurex for Standard Backpropagation -
The design expertise for a learning law can be taken from two sources: the scientific literature concerning the learning paradigms or directly from exhaustive experimentation of different typical tasks. To our knowledge, only a few research studies have analyzed the interdependencies among the backpropagation paradigm parameters regarding their tuning and scaling for different applications. Some parameters are maintained constant, while relationships among the others, representing their influences on learning, are established (Fahlman 1988; Kamiyama et al. 1992; Sundararajan et al. 1993; Tesauro and Janssens 1988). Only a fraction of the paradigm parameters are then considered in these relationships. This partial view cannot be used to establish efficient and general influences among the parameters. Nevertheless, general relations representing all the parameter dependencies can be enlightening to guide the iterations in the ANN design process for standard backpropagation. That is why we have chosen to realize our own empirical study to develop the design expertise for this learning law (Michaud et al. 1993). To establish these relations, we studied four simple cases, each used to explore different aspects of standard backpropagation learning with momentum descent. We began by studying two binary encoder-decoder problems (Fahlman 1988). The first is of dimension 10 and uses an error tolerance of 40% as training and testing success criteria. The other is a complement encoder problem of dimension 8, using an error tolerance of 30% as training and testing success criteria. This is a more complex problem than the first problem, using more active units and a more severe success criterion. The other two tasks are pixel classification problems. The first has to determine if an (x.y) coordinate is in a diamond-shaped region. The diamond-shaped recognition task has been exploited to verify the general characteristics extracted from the analysis of the encoder problems for an application using more training patterns. The second classification task is a two-spiral problem (Lang and Witbrock 1988) using double-loop spirals. This problem has been used to explore the learning process with a more complex task. Using simple cases made it possible to simulate a multitude of configurations in a reasonable time where we extracted useful information on design parameter dependencies and on harmful learning behavior reflected in the learning curve. But although these four simple cases may not represent all ANN design conditions for standard backpropagation learning with momentum descent, care was taken to ensure the generality of the expertise extracted. This was done by considering performance tendencies during training or during design choices instead of evaluating performance according to preestablished or fixed criteria. Results presented in Section 5 demonstrate this ability of the expertise extracted from the four simple cases to supervise ANN design for more complex tasks.
Francois Michaud and Ruben Gonzalez Rubio
1774
4.1 Criteria for Evaluation of the Learning Behavior and Performances. During our experimentation, we noticed that analysis of simulation results, for configurations with convergent or non-convergent trials, can give important information useful in making the design choices. In order to compare configurations having convergent trials, we established a list of important performance characteristics: minimizing the number of training epochs for convergence and the number of hidden units, maximizing the number of convergent trials and the testing performance, and favoring training similarities among trials of a same configuration. These characteristics allowed us to define empirical factors to evaluate the best configuration obtained. These factors are general enough to represent the various situations occurring during the ANN design process of any task. The first factor examined the compromise between the mean number of training epochs ( M N E C ) and the average training success ratio ("/oNC) for convergent trials. We named it the facforized mean ( F M ) , defined by equation 4.1:
MNEC FMz%NC The training success ratio is the number of convergent trials divided by the number of trials NT.This factor can be used to compare configuration performances and to determine the appropriate value of N T by increasing it until the FM factor stabilizes (within a 5% margin) between succeeding trials. The second factor can be used to select the best configuration and to optimize the number of training epochs, the training success ratio, and convergence stability.' We named it the comparison factor (CF), defined by equation 4.2, where V N E C corresponds to the standard deviation of the number of training epochs for convergent trials: CF=
(M:C ~
+-V N E C ) MNEC
(
1+--- 1 "kNC
)
(4.2)
This factor can be used to choose the best configuration with the same number of hidden units by taking the one with the minimum value. By dividing MNEC with NE, CF normalizes the training time according to a common parameter for all configurations compared, adapting the formula for any application. The ratio of the VNEC with the MNEC reflects the need for training stability between trials, even when rapid convergence is observed. Finally, these two ratios are multiplied by a term favoring configurations with a high number of convergent trials. The multiplication is used to let the compromise between the minimization of the training epochs and the maximization of the number of convergent trials be reflected on the number of training epochs for convergence and on the standard deviation. 'i.e., a small variation of the number of training epochs for convergence.
Autonomous Design of Artificial Neural Networks by Neurex
1775
The normalized RMS error (RMS,,,,,), defined as the sum of the squared error normalized by the number of training patterns and the number of output units, is useful for analyzing the training process in non-convergent configurations and to allow similar comparisons between applications. These analyses make it possible to detect harmful situations like oscillations, plateaus, and divergence during training. These are the conditions detected by the ANN simulator supervision mode. The slope of the RMS,,,,, can also be useful in selecting an adequate value for the maximum number of training epochs (NE). All of these analytical aspects represent the part of the design expertise implemented in the ANN simulator. The factors FM and CF are part of the results given by the simulator. The detection of harmful learning situations and the determination of the appropriate NE and NT values are implemented as supervision modes in the ANN simulator (Michaud et al. 1995). The four simple cases used to extract useful information for the ANN design process are not suited for the study of the generalization ability of the ANN. To consider this characteristic in the design expertise, we consulted scientific research studies on generalization with backpropagation learning and found that ANN generalization depends primarily on the training sets (Burrascano 1992), the number of training epochs (which is influenced by the training success criterion), and the number of hidden units. These three aspects are all interdependent. However, in view of the ANN generalization ability, the only criterion that can be optimized by Neurex is the number of hidden units, because the training set and the training success criterion are fixed by the user at the start of the design process. Therefore, we developed our design heuristic to find the minimal number of hidden units, giving good generalization ability to the ANN (Lendaris 1990) while allowing it to represent all the training vectors according to the training success criterion. This condition is restrictive because an ANN can have good generalization ability without being able to learn all the training set because of a small number of difficult training vectors. The performance of our design heuristic may then be affected by the training success criterion and the training vectors chosen by the Neurex user. To evaluate ANN generalization ability, a third factor has been defined. It is named the global factor (GF), defined by equation 4.3, where %TS is the average testing success ratio:6
(
G F = CF+-
lin) (I+&).
(4.3)
This formula is for minimizing H and CF while maximizing the testing success ratio associated to generalization. This factor is evaluated by the expert system because it uses the Hminvalue found by the heuristic. ~
6The percentage of test patterns that respect the testing success criterion when presented to the ANN during testing.
Franqois Michaud and Ruben Gonzalez Rubio
1776
Design Heuristic .---;
I
Initial parameter selection
7 Configurations to be simulated
. . . . . . . .. ... ., . -...pj
_, ,_ _ ,
! ANN j i Simulator j Results Analysis
.%--..-
,lll,.:
. . 1 1 1 - - 4 _ _ _ , 1 1 1 .
I
Figure 3: Design heuristic process.
4.2 Design Heuristic for Standard Backpropagation. The process followed by the design heuristic is presented in Figure 3 . To start the search process, the design heuristic must have an initial parameter configuration, as explained in Section 2. 'The initial configuration is only a starting point of the design iterations. Initial design parameters close to the optimal ones are not a prerequisite to Neurex success but could help reduce design time. Parameter configurations are then given to the ANN simulator. When the simulations are finished, results are analyzed and the design parameters are modified for the simulation of new configurations, until good ANN design parameters for the task are found. For the supervision of the design process and in order to find the design parameters efficiently, the standard backpropagation design heuristic has been divided into two phases: the exploratory search phase and the intensive search phase.
4.2.2 Explorafory Srarch Pliusr. This first phase tries to determine H,,,,,,, NE for this value of H . N T , and the appropriate search range for both the learning rate and the random initial weights parameters. During this search phase, the momentum is kept at zero because it can have unsuitable effects on training if it is initialized too big in relation to c. r, and H. To find H,,,,,,, if convergent trials exist in the configurations simulated, the heuristic decreases the number of hidden units and continues the exploratory search phase. The value NE is also increased to ensure enough training time for the convergence of ANN with a lesser number of hidden units, if possible. If there are only non-convergent configurations,
Autonomous Design of Artificial Neural Networks by Neurex
1777
rules are used to analyze simulation results in order to propose new configurations to simulate. Three overlapping condition loops are used: 1. Learning rate conditions. The goal is to find the best value of E for fixed r and H . With these rules, configuration comparisons are done in threes based on their learning rate, which are separated by a commonf, factor initialized at the start of the ~ e a r c hOscillations: .~ plateaus, divergence, and performance trends' for the compared configurations can be taken into consideration to help determine the correct modification of the learning rate. In addition, if the training stops with high RMS,,,,,, slopes, the heuristic may call the NE supervision mode of the ANN simulator to allow it to find a better value for the maximum number of training epochs. 2. Random initial weights conditions, with fixed H . A greater or lower value of r is determined using af, factor" according to the past explorations, and the learning rate conditions are again verified. This process is repeated until the MPT factor stops improving. 3. Hidden unit condition. This happens when it is not possible to find convergent trials for various E and r values. In this case, the number of hidden units must then be increased according to the complexity of the task and the best MPT found. The modification conditions for E and r are then reactivated. Before starting the intensive search phase, the heuristic requests a simulation with the NT supervision mode to let the ANN simulator find the appropriate value for this design parameter. In all, 13 rules are used for the exploratory search phase.
4.2.2 lntensivesearch Phase. The intensive search phase tries to find the best design configurations for the given task. It first tries to find the momentum, followed by the learning rate and the random initial weights parameters. This process is done using similar overlapping condition loops as those of the exploratory search phase. Momentum modifications are done to minimize the factorized mean factor (FM) (equation 4.1), while modifications for E and r try to minimize the comparison factor (CF) (equation 4.2). The intensive search phase starts with H,,,,,, and increases the number of hidden units until the global factor (GF) (equation 4.3) stabilizes after the modifications for a. E, and r are completed. The search starts with the best configuration found by the exploratory search phase. Learning rate increment and random initial weights incre-
ys= 2 is used. 8For example, oscillations suggest decreasing the learning rate. 9For example, if the maximum percentage of training vectors reaching the training success criterion (MPT) for the configuration with the smallest learning rate is greater than for the other two configurations, then the learning rate must be decreased. '"J = 2 is used.
1778
Franqois Michaud and Ruben Gonzalez Rubio
RESULTS CONFIG spir92 H=8 ~ 2 . ~0 0 . 2 5~ ~ 0 NE=30720 . 9 NT=12 CONVERGENT TRIALS MNEC = 1835.5 Total = 2 VNEC = 57.5 %TS = 100.0 CF = 0.637532 FM = 11013.0 NON-CONVERGENT TRIALS Total = 10 Oscillations = 7 Plateaus = 3 AverageTrainingEpoch = 5120.7 MPT = 89.0625%
RESULTS CONFIG spir2 H= 6 r= 0.5 E= 1.0 a= 0 N E = 30720 NT= 5 NON-CONVERGENT TRIALS Total = 5 Oscillations = 5 Plateaus = 0 AverageTrainingEpoch = 122.0 MPT = 50.06
Figure 4: Example of results obtained by the Neurex ANN simulator ment are determined according to the search ranges from the exploratory search phase. In all, 14 rules are used by the intensive search phase. 5 Results of Autonomous Design of ANN Using Standard Backpropagation with Neurex
The design expertise for standard backpropagation learning with momentum descent implemented in Neurex has been experimented on nine different applications using a RISC 6000 model 550. First, Neurex has been validated with the four simple experimentation cases used to develop this expertise. For the encoder-decoder tasks and the diamond-shaped recognition task, Neurex found correct configurations according to the ranges observed during our "manual" experimentation. But it is for the double-spiral task that Neurex proved most useful. The best configuration found by Neurex uses H 8, r = 2, := 0.25, ( t = 0.7, and NT = 12. The average convergence time of this configuration is 5477 epochs. During the exploratory search phase, the value of H,,,,,,found is 7. Finally, the minimum average convergence time found during the intensive search phase is 1835.5 epochs for the configuration s p i r 9 2 of Figure 4. These results are better than the ones found during our experimentation without Neurex. Figure 4 also shows the important saving of training time when oscillation detection is used by the ANN simulator. Configuration s p i r 2 has five oscillatory trials detected on an average of 122 epochs. This represents a saving of 99.6% according to the NE parameter fixed at 30,720 epochs, or a saving of 93.4% according to the minimum average convergence time found for configuration s p i r 9 2 (Michaud 1993). We have also validated the efficiency of the design expertise for other applications. First, Neurex has been used in a feasibility study to find the applicability of the standard backpropagation with momentum descent paradigm for establishing radiance-reflectance spectrum correspondence and classification (Royer et d. 1989). Design parameters for these two applications were completely unknown. Neurex has been able to
Autonomous Design of Artificial Neural Networks by Neurex
1779
Table 1: Experimentation with Radiance-Reflectance Spectrums
Description
Radiance spectrum classification
Radiance-reflectance spectrums correspondence
73 73 Inputs 49 Outputs 11 13 Training vectors 17 5Yo Error tolerance 30% Initial parameters (9, 0.5, 0.5, 0.5, 12240, 5) (9, 0.5, 0.5, 0.5, 9360, 5) (H. r. E . a. NE, N T ) Final parameters (10, 1.83, 0.0625, 0.2, 12240, 4) (12, 0.5, 0.03125, 0.9, 9360, 4) ( H .r. E . a . NE, N T ) MNEC = 10143 MNEC = 917.75 Number of configurations 84 104
find, autonomously, appropriate design parameters for these applications, thereby demonstrating its usefulness and efficiency to establish rapidly the possibility of using the standard backpropagation with momentum descent paradigm for unknown design tasks. Table 1 presents the application characteristics and results obtained by Neurex. Neurex took approximately 22 hours of CPU time to find the design parameters for each of these problems. We have also validated Neurex for applications where the generalization ability of the ANN is of concern during the design process. In the following cases, the training and the testing data sets are different. First, two number recognition applications were submitted to Neurex. Application characteristics and Neurex design results are presented in Table 2. During the intensive search phase of the printed number task, Neurex selected the best configurations for H between 3 and 5 and concluded that the configuration with the best generalization ability (according to the GF factor) is found for H = 4. For the handwritten number recognition problem, our preliminary experimentation without Neurex led us to believe that H,,,,, = 7. But Neurex found H,,ill = 3, demonstrating again its efficient search heuristic. Neurex took approximately 7 hours of CPU time to find the appropriate design configuration for the handwritten number recognition task. To demonstrate the exploratory search phase process described in Section 4.2, Figure 5 shows charts of the first design configurations selected by the expert system during this search phase for the printed number recognition problem. For these configurations, a is equal to 0, and NT is equal to 5. The first chart represents H and the number of convergent trials found for each configuration. The second chart shows r and E used for the configurations. The third chart compares NE with the average training epochs used for simulating non-convergent trials. The fourth chart describes the type of non-convergent trials: oscillations, plateaus, or trials stopped because of the NE limit. Finally, the fifth chart shows
Franqois hfichaud a n d Ruben Gonzaiez Rubio
1780
Table 2: Experimentation with Number Recognition Applications
Description Input\ output\ \kctors Error tolerance Initial parameters iH I ( t NE NTI Final parameters I H I f i NE NTI \umber of configurations "<,TS (may, mean min)
Printed number recognition
Handwritten number recognition
35 10 20 (Training) 10 (Testing) 30"O (3, 0.5, 0.5, 0.5. 3600, 5)
50
(4, 0 79, 0.6375, 0.2, 3600, 4) M N E C = 155.2. H =3 140 90",,, 76",>,6O"G ,,,,/1
10 70 (Training) 30 (Testing) 30""
(6, 0 5,0 5,0 5 , 7000, 5)
(5,0 25, 0 03125, 0 8, 10034, 4) MNEC = 1029 5 H ,,,,, = 3 89
loo"<,,82 5",,, 73 3('<,
M P T , the maximum percentage of training vectors reaching the training success criterion for non-conL'ergent trials, which is used to evaluate the performance of non-convergent trials. During this exploratory search phase, three configurations were first simulated (configurations 0-2). For these configurations, H. Y, and NE remained constant: only their c were different. Because a configuration having convergent trials was found (configuration 0), H was decreased from 3 to 2, and three other configurations were simulated (configurations 3-5). The same r and E as in the first three configurations were used, but N E was increased to allow more training time." Only non-convergent trials with oscillations were found for these three configurations. As we can see, increasing c did not improve the ANN performances: the average training epochs for non-convergent trials decreased," as for the M P T . Then, because configuration 3 was the best configuration found, the heuristic chose to decrease 5 (configuration 6) to see if a smaller learning rate would give better results. Although oscillations were not the principal cause of non-convergence for configuration 6, it seemed unnecessary to pursue decreasing :. A maximum was observed for the MPT based on the variation of t (configuration 3), and the design heuristic modified r to start a new search for an appropriate learning rate value (configurations 7-9). The learning rates used for these new configurations were adjusted based on the results obtained for configurations 3 to 6. The second application considering the generalization ability of the -4" is the sonar classification task. This commonly known benchmark was introduced by Gorman and Sejnoivski (1988) to analyze the effect "Because the number of hidden units were decreased. "This indicates that more oscillations were detected by the ANN simulator for the configurations with greater :, allobving the simulation to be stopped earlier by the A N N simulator.
Autonomous Design of Artificial Neural Networks by Neurex
4 3 2
1781
Number of Convergent Trials
1
0 1.2 1.O 03 0.6
0.4 0.2 0.0 0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
5000 r 4000
g 3000
w 2000
1000
0
-
I
I YY
80
z m 40 20
Figure 5: Characteristics and simulation results obtained for the first 10 configurations simulated in the exploratory search phase for the printed number recognition application. of different numbers of hidden units on the ANN performance, with all other design parameters fixed.I3 The learning law used is a modified version of the standard backpropagation paradigm, allowing backpropagation only when the error is greater than O.2.I4 Because of these differences, our study of this application was mainly oriented to finding a good design parameter configuration for this task using the standard backpropagation with momentum descent paradigm, by allowing the modification of all design parameters. We used the aspect-angle dependent series for the training and testing sets, each composed of 104 vectors of 60 inputs and 2 outputs. The training and the testing success criteria have an error tolerance of 40%. "W ' hti r = 0.3,E = 2, a = 0, NE = 300, and NT = 10. l4Thisis done to avoid overlearning and to favor generalization and fast convergence.
Franqois Michaud and Ruben Gonzalez Rubio
1782
Table 3: Experimentation with the Sonar Classification Task
Design parameters
MNEC
YoTS
FM
CF
GF
H = 3r = 1 . 0 5 ~ = 0.03125rt = 0.4 H = 4 r = 0.3: = 0.03125(t = 0.7 H = 5 Y = 0.3 E = 0.03125 a = 0.9" H = 6 r = 1.05z = 0.03125 ct = 0.7
911 582
81.1 84.1
1215 1164
5.6 4.9
197
85.9 80.8
263
1.5 0.9 0.3 0.6
356
711
4.3
5.9
"Best otwall result obtained for this application and this table.
We used the same initial conditions as Gorman and Sejnowski did to start the design process, except for NE which was set to 2000 to see if configurations could converge if they were given more time to train. During the exploratory search phase, Neurex found that H,,,,,,is 3. The best contigurations found for each H are presented in Table 3, where NT is equal to 4 (as chosen by Neurex during the exploratory search phase), NE is equal to 2000, and results are presented for the trials that completely learned the training set. Differences between MNEC and FM indicate that some trials did not converge. Except for these non-convergent trials, only one training vector was not learned according to the training success criteria, so that the average performance on training sets varied between 99.52% and 99.76% for these configurations. The best configuration was obtained for H = 5: it has the best learning time and the best testing success ratio, as expressed by the FM, CF, and GF factors. The standard deviation for the testing success ratio of this configuration was 1.2%1.Because the configuration performance (according to GF) stopped improving after H = 5, Neurex stopped simulating after H = 6. The final configuration is given after the simulation of 100 configurations and 5 hours of processing time. These results indicate that an ANN is able to learn all the training vectors using a much smaller number of hidden units than H = 24 as indicated by Gorman and Sejnowski (1988),only by adjusting the other design parameters. 6 Conclusion
~-
In this article, we presented the general characteristics of Neurex, which autonomously supervises the iterative ANN design process. Neurex tries to reproduce the design expertise used by a human developer in three ways: training supervision by the ANN simulator, a design heuristic based on parameter interdependencies for the learning paradigm and implemented in the expert system, and a database of known design applications to help start the design process for unknown design tasks. These three aspects are not covered by the other autonomous ANN design systems.
Autonomous Design of Artificial Neural Networks by Neurex
1783
The only design expertise implemented in Neurex thus far is for the standard backpropagation learning law. This expertise is represented by using evaluation criteria, supervision modes and rules, which have been briefly presented in this article. We must emphasize that this expertise has been developed empirically by trying to exploit useful design parameters’ interdependencies. It should not be interpreted as the best design expertise for this learning paradigm. Other ANN developers may use different methods (internal analysis of the ANN, cross-validation [Hecht-Nielsen 19891, optimization of other criteria, statistical analysis, etc.) which could lead to new design expertise in Neurex. But the expertise implemented in Neurex has shown to be very efficient and useful for autonomous ANN design with the standard backpropagation with momentum descent paradigm. Furthermore, the design heuristic presented tries to find the quasi-optimal parameter configuration for a given application. It does not only try to find a configuration with convergent trials. A certain number of configurations must then be simulated before reaching this goal. One advantage is that all of it is done autonomously by Neurex, without the need for human supervision. Using this design heuristic, the art of finding the appropriate number of hidden units (Santini 1992) is now systematized in a set of general design rules that find the correct combination of the other design parameters that give the best design performance for every H. Finally, Neurex can be useful to unskilled users and experienced developers confronting a new application, as shown for the radiancereflectance spectrums applications. It can also be used for establishing fair comparisons between learning paradigms by finding the best design parameters for each one with a given benchmark and then comparing ANN performance results. All in all, using Neurex results in a smaller ANN design effort. The strength of Neurex lies in the number of different learning laws that could be controlled for solving any kind of problems and in the effectiveness of the design heuristics supervising their development. Design expertise for other learning paradigms will have to be implemented in Neurex to allow the selection of the best learning paradigms and to gather the different learning laws into an environment correctly establishing their aim and design principles. It will also give versatility to Neurex for supervising the ANN design process for different types of tasks.
Franqois Michaud and Ruben Gonzalez Rubio
1784
Appendix Symbol
-
Description
Momentum Learning rate Comparison factor CF Learning rate factor f Random initial weights factor f, FM Factorized mean Global factor GF H Number of hidden units Minimum number of hidden units for convergence &,I Mean number of training epochs for convergent trials MNEC MPT Maximum percentage of training vectors reaching the training success criterion for non-convergent trials Average training success ratio ONC NE Maximum number of training epochs Number of trials NT r Range of the random initial weights Root mean square RMS RMS,,,,,,,, Root mean square normalized %TS Average testing success ratio VNEC Standard deviation of the number of training epochs (I
Acknowledgments
__
Support from the Natural Sciences and Engineering Research Council of Canada (NSERC) was highly appreciated. We also thank the anonymous referees for their helpful comments.
References
-
Burrascano, I? 1992. Network topology, training set size and generalization ability in MLP’s project. In Artificinl Neirrnl Netioorks 2, pp. 113-116. Elsevier Science. Caudill, M. 1991. Expert networks. BYTE 108-116. Drago, G. I?, and Ridella, S. 1992. SCAWI: .4n algorithm for weight initialization of a sigmoidal neural network. In Artificinl Netirnl Netziwks 2, pp. 983-986. Elsevier Science. Fahlman, S. E. 1988. Faster-learning variations on back-propagation: An empirical study. In Proc. Corzrzectiorzist Models Siiinrner School, pp. 38-51. San Mateo, CA.
Autonomous Design of Artificial Neural Networks by Neurex
1785
Fahlman, S. E., and Lebiere, C. 1989. The cascade-correlation learning architecture. In Advances in Neiiral Information Processing Systems, pp. 524-532. Morgan Kaufmann, San Mateo, CA. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. Harp, S. A., Samad, T., and Guha, A. 1990. Designing application-specific neural networks using genetic algorithm. In Advances in Neiiral Information Processing Systems, Vol. 2, pp. 447-454. Morgan Kaufmann, San Mateo, CA. Hecht-Nielsen, R. 1989. Neurocomputing. Addison-Wesley, Reading, MA. Hilario, M. 1995. An overview of strategies for neurosymbolic integration. In Workshop on Connectionist-Symbolic Integration: From Unified to Hybrid Approaches, Int‘l Joint Conf. on Artificial Intelligence, pp. 1-6. Montreal, Quebec, Canada. Jacobs, R. A. 1987. Increased rates of convergence through learning rate adaptation. COINS Tech. Rep. 87-117, Department of Computer and Information Science, University of Massachusetts. Kamiyama, N., Iijima, N., Taguchi, A., Mitsui, H., Yoshida, Y., and Sone, M. 1992. Tuning of learning rate and momentum on back-propagation. In Communications on the Move, pp. 528-532. Singapore. Kung, S. Y., and Hwang, J. N. 1988. An algebraic projection analysis for optimal hidden units size and learning rates in back-propagation learning. In Proc. lnt‘l Joint Conf. of Neural Networks, Vol. 1, 363-370. San Diego, CA. Lang, K. J., and Witbrock, M. J. 1988. Learning to tell two spirals apart. In Proc. Connectionist Models Summer School, pp. 52-59. San Mateo, CA. Lendaris, G. G. 1990. A proposal for indicating quality of generalization when evaluating ANNs. In Proc. Int’l Joint Conf. of Neural Networks, Vol. 1, 709-713. San Diego, CA. Michaud, F. 1993. Reseau expert pour gerer le processus iteratif de conception de reseaux de neurones artificiels utilisant la retropropagation standard, Master’s thesis (in French), Department of Electrical and Computer Engineering, Universite de Sherbrooke, Quebec, Canada. Michaud, F., Rubio, R. G., and Berkane, A. 1993. Empirical study of parameter interdependencies for backpropagation learning. Tech. Rep. GEI-1993-001, Department of Electrical and Computer Engineering, Universite de Sherbrooke, Quebec, Canada. Michaud, F., Rubio, R. G., Dalle, D., and Ward, S. 1995. Simulateur de reseaux de neurones artificiels integrant une supervision de leur entrainement. Canadian Journal of Electrical and Computer Engineering 20(1), 25-34. Neuralware. 1993. Neural Computing-A Technology Handbook for Professional II/Plus and NeuralWorks Explorer. Neuralware, Inc. Plaut, D. C., and Nowlan, S. J. 1986. Experiments on learning by back-propagation. Tech. Rep. CMU-CS-86-126, Computer Science Department, CarnegieMellon University. Royer, A,, O’Neill, N. T., Williams, D., Cliche, P., and Verreault, R. 1989. SystPme de mesures de reflectances pour les spectrometres imageurs. In Int’l Geoscience and Remote Sensing Symposium (IGARSS), pp. 2097-2100. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing:
Francois Michaud and Ruben Gonzalez Rubio
1786
Expbratioris in the Micrmtrtrctitve of Cognitioii. Vol. 1. MIT Press, Cambridge, MA. Santini, S. 1992. The bearable lightness of being: Reducing the number of weights in backprop. In Artificial Nmral Nt7t7crorks 2, pp. 139-142. Elsevier Science. Sundararajan, N., Chin, L., and San, Y. K. 1993. Selection of network and learning parameters for an adaptive neural robotic control scheme. Meckatvonics 3(6), 747-766. Tesauro, G., and Janssens, 8. 1988. Scaling relationships in back-propagation learning. In Complex Systmis 2, 39-44, Wah, B. W., and Kriplani, H. 1990. Resource constrained design of artificial neural networks. In Proc. Iiit’l joiizt Cot$ ofNeiird Nrt7tlorks, Vol. 3, pp. 269279. San Diego, CA. ~
~~~~~
ReceiLed Julv 14, 1995, accepted March 27, 1996
This article has been cited by:
1787
Index
Volume 8 By Author
Abarbanel, Henry D. I.; Huerta, R.; Rabinovich, M. I.; Rulkov, N. F.; Rowat, P. F.; and Selverston, A. I. Synchronized Action of Synaptically Coupled Chaotic Model Neurons (Article) Abbott, L. F.-See Amari, S.-See
8(8):1567-1602
Blum, Kenneth I.
Muller, K.-R.
An, Guozhong The Effects of Adding Noise During Backpropagation Training on a Generalization Performance (Letter)
8(3):643-674
Arrington, Karl Frederick Directional Filling-in (Letter)
8(2):300-318
Atick, Joseph J.; Griffin, Paul A.; and Redlich, A. Norman Statistical Approach to Shape from Shading: Reconstruction of Three-Dimensional Face Surfaces from Single Two-Dimensional Images (Letter)
8(6):1321-1340
August, David A. and Levy, William B A Simple Spike Train Decoder Inspired by the Sampling Theorem (Letter)
8(1)~67-84
Bair, Wyeth and Koch, Christof Temporal Precision of Spike Trains in Extrastriate Cortex of the Behaving Macaque Monkey (Letter) 8(6):1185-1 202
Bakker, Paul-See
Wiles, Janet
Baldi, Pierre and Chauvin, Yves Hybrid modeling, HMM/NN architectures, and protein applications (Letter) Barber, David and Saad, David Does Extra Knowledge Necessarily Improve Generalization? (Letter)
8(7):1541-1565
8(1):202-214
Index
1788
Barrow, Harry G.; Bray, Alistair J.; and Budd, Julian M. L. A Self-organizing Model of “Color Blob” Formation (Letter)
8(7):1427-1448
Bartlett, Peter L. and Williamson, Robert C. The VC Dimension and Pseudodimension of Two-Layer Neural Networks with Discrete Inputs (Letter)
8(3):625-628
Bauer, H.-U.; Der, R.; and Herrmann M. Controlling the Magnification Factor of Self-organizing Feature Maps (Letter)
8(4):757-771
Levy, William B
Raxter, Robert A.-See Beaufays, Francoise-See
Wan, Eric A
Bishop, Christopher M. and Nabney, Ian T. Modeling Conditional Probability Distributions for Periodic Variables (Letter) Blum, Kenneth I. and Abbott, L. F. A Model of Spatial Map Formation in the Hippocampus of the Rat (Letter) Boers, Egberg J. W.-See Bray, Alistair J.-See
8(5):1123-1133
8(1):85-93
Sprinkhuizen-Kuyper, Ida G
Barrow, Harry G.
Brunel, Nicolas Hebbian Learning of Context in Recurrent Neural Networks (Letter) Budd, Julian h4. L.-See
8(8):1677-1710
Barrow, Harry G
Budinich, Marco A Self-organizing Neural Network for the Traveling Salesman Problem That Is Competitive with Simulated Annealing (Letter)
8(2):416424
Cartling, Bo Response Characteristics of a Low-Dimensional Model Neuron (Letter)
8(8):1643-1652
Casey, Mike The Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine Extraction (Article)
8(6):1135-1 178
Chauvin, Yves-See
Baldi, Pierre
Index
1789
Chay, Teresa Ree Modeling Slowly Bursting Neurons via Calcium Store and Voltage-Independent Calcium Current (Letter)
8(5):951-978
Chen, Yinong and Reggia, James A. Alignment of Coexisting Cortical Maps in a Motor Control Model (Letter)
8(4):731-755
Chover, Joshua Neural Correlation via Random Connections (Letter) Cooper, Leon N-See
Shouval, Hare1
Cowan, Jack D.-See
Gerstner, Wulfram
DasGupta, Bhaskar and Schnitger, Georg Analog versus Discrete Neural Networks (Letter) Deco, Gustavo-See Der, R.-See
8(8):1711-1729
8(4):805-818
Parra, Lucas
Bauer, H.-U.
Dundar, G.; Hsu, F-C.; and Rose, K. Effects of Nonlinear Synapses on the Performance of Multilayer Neural Networks (Letter) Eldracher, Martin-See Elias, John G.-See
Schmidhuber, Jurgen
Northmore, David P. M.
Ermentrout, Bard Type I Membranes, Phase Resetting Curves, and Synchrony (Letter) Feng, Jianfeng; Pan, Hong; Roychowdhury, Vwani P. On Neurodynamics with Limiter Function and Linsker 's Developmental Model (Letter) Fiesler, Emile-See
Thimm, Georg
Fine, Terrence L.-See Finke, M.-See Fissell, Kate-See
8(5):939-949
Mukherjee, Sayandev
Muller, K.-R. Jenison, Rick L.
Foltin, Bernhard-See
Schmidhuber, Jurgen
Gabbiani, Fabrizio and Koch, Christof Coding of Time-Varying Signals in Spike Trains
8(5):979-1001
8(5):1003-1019
Index
1790
of Integrate-and-Fire Neurons with Random Threshold (Letter)
8(1)144-66
Gerstner, Wulfram; van Hemmen, J. Leo; and Cowan, Jack D. What Matters in Neuronal Locking? (Letter)
8(8):1653-1676
Omlin, Christian W.
Giles, C. Lee-See
Girosi, Federico-See
Niyogi, Partha
Gold, Steven; Rangarajan, Anand; and Mjolsness, Eric Learning with Preknowledge: Clustering with Point and Graph Matching Distance Measures (Letter) Gold, Steven-See
8(4):787-804
Rangarajan, Anand
Griffin, Paul A.-See
Atick, Joseph J.
Harvey, Inman and Stone, James V. Unicycling Helps Your French: Spontaneous Recovery of Associations by Learning Unrelated Tasks (Note)
8(4):697-704
Hatsopoulos, Nicholas G. Coupling the Neural and Physical Dynamics in Rhythmic Movements (Letter)
8(3)1567-581
Herrmann, M -See
Bauer, H.-U
Heskes, Tom-See
Wiegerinck, Wim
Hinzer, Karin-See
Longtin, Andre
Hole, Arne Vapnik-Chervonenkis Generalization Bounds for Real Valued Neural Networks (Letter)
8(6):1277-1 299
Horn, David; Levy, Nir; and Ruppin, Eytan Neuronal-Based Synaptic Compensation: A Computational Study in Alzheimer's Disease (Letter)
8(6):1227-1 243
Horn, David and Opher, Irit Temporal Segmentation in a Neural Dynamic System (Letter)
8(2):373-389
Hsu, F-C.-See
Huerta, R.-See
Dundar, G. Abarbanel, Henry D. I.
Index
1791
Huq, Shaheedul-See
Stevenson, Maryhelen
Huuhtanen, Pentti-See Intrator, Nathan-See
Lehtokangas, Mikko Shouval, Harel
Jenison, Rick L. and Fissell, Kate A Spherical Basis Function Neural Network for Modeling Auditory Space (Letter) Jordan, Michael I.-See Kaski, Kimmo-See
Xu, Lei
Lehtokangas, Mikko
Kehagias, Athanasios-See
Petridis, Vassilios
Kirby, Michael J. and Miranda, Rick Circular Nodes in Neural Networks (Letter) Koch, Christof-See
Bair, Wyeth
Koch, Christof-See
Gabbiani, Fabrizio
Kohlmorgen, Jens-See
8(2):390402
Pawelzik, Klaus
Lappe, Markus Functional Consequences of an Integration of Motion and Stereopsis in Area MT of Monkey Extrastriate Visual Cortex (Letter) Law, C. Charles-See
8(1):115-128
8(7):1449-1462
Shouval, Harel
Lee, Christopher W. and Olshausen, Bruno A. A Nonlinear Hebbian Network that Learns to Detect Disparity in Random-Dot Stereograms (Letter)
8(3)1545-566
Lehtokangas, Mikko; Saarinen, Jukka; Huuhtanen, Pentti; and Kaski, Kimmo Predictive Minimum Description Length Criterion for Time Series Modeling with Neural Networks (Letter)
B(3)1583-593
Levy, Nir-See
Horn, David
Levy, William B and Baxter, Robert A. Energy Efficient Neural Codes (Letter) Levy, William &See
8(3):531-543
August, David A.
Li, Zhaoping A Theory of the Visual Motion Coding in the Primary Visual Cortex (Letter)
8(4):705-730
Index
1792
Linster, Christiane and Masson, Claudine A Neural Model of Olfactory Sensory Memory in the Honeybee's Antenna1 Lobe (Letter)
8(1):94-114
Littmann, Enno and Ritter, Helge Learning and Generalization in Cascade Network 8(7):1521-1540 Architectures (Letter) Longtin, Andre and Hinzer, Karin Encoding with Bursting, Subthreshold Oscillations, and Noise in Mammalian Cold Receptors (Article) Lynton, Adam-See
Wiles, Janet
Lytton, William W. Optimizing Synaptic Conductance Calculation for Network Simulations (Note)
Maass, Wolfgang Lower Bounds for the Computational Power of Networks of Spiking Neurons (.4rticle) MacKay, David J . C. Equivalence of Linear Boltzmann Chains and Hidden Markov Models (Letter) Masry, Elias-See
8(2):215-255
8(3):501-509
8(1):140
8( 1):178-1 81
Modha, Dharmendra S.
Masson, Claudine-See
Linster, Christiane
Mato, G. and Sompolinsky, H. Neural Network Models of Perceptual Learning of Angle Discrimination (Letter) Meyer-Base, Anke; Ohl, Frank; and Scheich, Henning Singular Perturbation Analysis of Competitive Neural Networks with Different Time Scales (Letter)
8(2):270-299
8(8):1731-1 742
Mhaskar, H. N. Neural Networks for Optimal Approximation of Smooth and Analytic Functions (Letter)
8( 1):164-1 77
Michaels, Ronald Associative Memory with Uncorrelated Inputs (Note)
8(2) 1256-259
Michaud, Franqois and Rubio, Ruben Gonzalez Autonomous Design of Artificial Neural Networks by NeuFeex (Letter)
8(8) 1767-1786
1793
Index
Miesbach, Stefan-See
Parra, Lucas
Miller, David and Rose, Kenneth Hierarchical, Unsupervised Learning with Growing via Phase Transitions (Letter) Miranda, Rick-See
Kirby, Michael J.
Mjolsness, Eric-See
Gold, Steven
Mjolsness, Eric-See
Rangarajan, Anand
8(2):425450
Modha, Dharmendra S. and Masry, Elias Rate of Convergence in Density Estimation Using 8(5):1107-1122 Neural Networks (Letter) Moerland, Perry-See
Thimm, Georg
Molina, Christophe and Niranjan, Mahesan Pruning with Replacement on Limited Resource Allocating Networks by F-Projections (Letter)
8(4):855-868
Moody, John-See Wu, Lizhong Morciniec, Michal-See
Rohwer, Richard
Mukherjee, Sayandev and Fine, Terrence L. Online Steepest Descent Yields Weights with Nonnormal Limiting Distribution (Letter)
8(5):1075-1084
Muller, K.-R.; Finke, M.; Murata, N.; Schulten, K.; and Amari, S. A Numerical Study on Learning Curves in Stochastic Multilayer Feedforward Networks (Letter)
8(5):1085-11 06
Muller, Klaus-Robert-See Murata, N.-See
Pawelzik, Klaus
Muller, K.-R.
Nabney, Ian T.-See
Bishop, Christopher M.
Nakano, Ryohei - See Saito, Kazumi Niranjan, Mahesan-See
Molina, Christophe
Niyogi, Partha and Girosi, Federico On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions (Letter) Norris, Michael-See
Wiles, Janet
8(4)B19-842
1791
Index
Northmore, David P. M. and Elias, John G. Spike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in Dendrites (Letter) 8(6)11245-1265 Ohl, Frank-See
Meyer-Base, Anke
Olshausen, Bruno A.-See
Lee, Christopher W.
Omlin, Christian W. and Giles, C. Lee Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants (Article) Opara, Raif and Worgotter, Florentin Using Visual Latencies to Improve Image Segmentation (Letter) Opher, Irit-See
8(4):675-696
8(7) 1493-1520
Horn, David
OReilly, Randall C. Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm (Article)
8(5):895-938
Orponen, Pekka The Computational Power of Discrete Hopfield Nets with Hidden Units (Letter)
8(2):403415
Pan, Hong-See
Feng, Jianfeng
Parkinson, Sean-See
Wiles, Janet
Parra, Lucas; Deco, Gustavo; and Miesbach, Stefan Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps (Note)
8( 2):260-269
Partridge, D. and Yates, W. B. Engineering Multiversion Neural-Net Systems (Letter)
8(4):869-893
Pawelzik, Klaus; Kohlmorgen, Jens; and Miiller, Klaus-Robert Annealed Competition of Experts for a Segmentation and Classification of Switching Dynamics (Letter)
8(2):340-356
Pearlmutter, Barak A.-See
Zador, Anthony M.
Index
1795
Petridis, Vassilios and Kehagias, Athanasios A Recurrent Network Implementation of Time Series Classification (Letter) Qian, Ning-See
8(2):357-372
Zhu, Yu-Dong
Rabinovich, M. I.-See
Abarbanel, Henry D. 1.
Rangarajan, Anand; Gold, Steven; and Mjolsness, Eric A Novel Optimizing Network Architecture with Applications (Letter) Rangarajan, Anand-See
Gold, Steven
Redlich, A. Norman-See Reggia, James A.-See Ritter, Helge-See
8(5):1041-1060
Atick, Joseph J.
Chen, Yinong
Littmann, Enno
Rohwer, Richard and Morciniec, Michal A Theoretical and Experimental Account of n-Tuple Classifier Performance (Letter)
8(3):629-642
Rohwer, Richard and van der Rest, John C. Minimum Description Length, Regularization and Multimodal Data (Letter)
8(3):595-609
Rohwer, Richard-See
Zhu, Huaiyu
Rojas, Radl A Short Proof of the Posterior Probability Property of Classifier Neural Networks (Note) Rose, K.-See
Dundar, G.
Rose, Kenneth-See Rowat, P. E-See
Miller, David
Abarbanel, Henry D. I.
Roychowdhury, Vwani P.-See
Feng, Jianfeng
Rubio, Ruben Gonzalez-See Michaud, Franqois Rulkov, N. E-See
Abarbanel, Henry D. I.
Ruppin, Eytan-See Saad, David-See
Horn, David Barber, David
Saarinen, Jukka-See Scheich, Henning-See
Lehtokangas, Mikko Meyer-Base, Anke
8(1):4143
Index
1796
Schmidhuber, Jurgen; Eldracher, Martin; and Foltin, Bernhard Semilinear Predictability Minimization Produces Well-Known Feature Detectors (Letter) Schmidhuber, Jurgen-See Schnitger, Georg-See Schulten, K.-See
8(4):773-786
Hochreiter, Sepp
DasCupta, Bhaskar
Muller, K.-R.
Selverston, A. 1.-See Abarbanel, Henry D. I. Shouval, Harel; Intrator, Nathan; Law, C. Charles; and Cooper, Leon N Effect of Binocular Cortical Misalignment on Ocular Dominance and Orientation Selectivity (Letter) Snippe, Herman P. Parameter Extraction from Population Codes: A Critical Assessment (Letter) Sompolinsky, H.-See
8(3):511-529
Mato, G
Sprinkhuizen-Kuyper, Ida C. and Boers, Egberg J. W. The Error Surface of the Simplest XOR Network Has Only Global Minima (Letter) Staples, Mark-See
8(5):1021-1040
8(6):1301-1320
Wiles, Janet
Stevenson, Maryhelen and Huq, Shaheedul On the Capacity of Threshold Adalines with Limited-Precision Weights (Note)
8(8):1603-1610
Stone, James V. Learning Perceptually Salient Visual Parameters Using Spatiotemporal Smoothness Constraints (Letter)
8(7):1463-1492
Stone, James V.-See
Harvey, Inman
Sum, John P. F. and Tam, Peter K. S. Note on the Maxnet Dynamics (Note) Tam, Peter K. S.-See
8(3):491-499
Sum, John P. F.
Thimm, Georg; Moerland, Perry; and Fiesler, Emile The Interchangeability of Learning Rate and Gain in Backpropagation Neural Networks (Letter)
8(2):451460
1797
Index
Tibshirani, Robert A Comparison of Some Error Estimates for Neural Network Models (Letter)
8(1):152-163
Urahama, Kiichi Gradient Projection Network: Analog Solver for Linearly Constrained Nonlinear Programming (Letter)
8(5):1061-1073
Urbanczik, R. A Large Committee Machine Learning Noisy Rules (Letter)
8(6):1267-1276
van der Rest, John C.-See
Rohwer, Richard
van Hemmen, J. Leo-See
Gerstner, Wulfram
Wan, Eric A. and Beaufays, Franqoise Diagrammatic Derivation of Gradient Algorithms for Neural Networks (Letter)
8(1):182-201
Wang, Wei-Ping Binary-Oscillator Networks: Bridging a Gap between Experimental and Abstract Modeling of Neural Networks (Letter)
8(2)1319-339
Whiteside, Alan-See
Wiles, Janet
Wiegerinck, Wim and Heskes, Tom How Dependencies between Successive Examples Affect On-Line Learning (Letter)
8(8):1743-1765
Wiles, Janet; Bakker, Paul; Lynton, Adam; Norris, Michael; Parkinson, Sean; Staples, Mark; and Whiteside, Alan Using Bottlenecks in Feedforward Networks as a Dimension Reduction Technique: An Application to Optimization Tasks (Note) 8(6):1179-1183 Williams, Peter M. Using Neural Networks to Model Conditional Multivariate Densities (Letter) Williamson, James R. Neural Network for Dynamic Binding with Graph Representation: Form, Linking, and Depth-from-Occlusion (Letter) Williamson, Robert C.-See
Bartlett, Peter L.
8(4):843-854
8(6):1203-1225
Index
1798
Wolpert, David H. The Existence of a Priori Distinctions Between Learning Algorithms (Article)
8(7):1391-1420
Wolpert, David H. The Lack of a Priori Distinctions Between Learning Algorithms (Article)
#(7): 1341-139 0
WBrgiitter, Florentin-See
Opara, Ralf
Wu, Lizhong and Moody, John A Smoothing Regularizer for Feedforward and Recurrent Neural Networks (Article)
8(3):461489
Xu, Lei and Jordan, Michael I.
On Convergence Properties of the EM Algorithm for Gaussian Mixtures (Letter)
Yates, W. B.-See
8(1):129-151
Partridge, D
Zador, Anthony M. and Pearlmutter, Barak A. VC Dimension of an Integrate-and-Fire Neuron Model (Letter)
8(3):611-624
Zhu, Huaiyu and Rohwer, Richard No Free Lunch for Cross Validation (Note)
8(7):1421-1426
Zhu, Yu-Dong and Qian, Ning Binocular Receptive Field Models, Disparity Tuning, and Characteristic Disparity (Letter)
#(8):1611-1641